2022 High Throughput Imaging Characterization of Brain Cell Types & Connectivity-Day 2, Part 3
YONG YAO: Let's start the afternoon session. Session three on data management. It's my pleasure to introduce Simon Ilyushchenko. Simon is the data ingestion lead for Google Earth Engine. Working on the Google Earth Engine products since its inception in 2009. Earth Engine is a planetary-scale geospatial platform that helps manage global natural resources and address climate changes. As of now, about 70 petabytes of public imaging data about the planet Earth have been collected and used for research and education. The images have spatial resolution from 60 centimeters through hundreds of kilometers, spanning six orders of magnitude just like that of brain imaging, electron microscopic nanometer resolution through MRI millimeter resolution, albeit at the other end of resolution spectrum.
The Earth Engine data have also temporal granularity from minutes through days and years. This presentation will discuss how the planetary imaging data are ingested, processed, and analyzed. I ask Simon to also discuss current challenges in managing and using petabyte-scale imaging data and how to improve the user experience. I believe his presentation on Google Earth Engine will provide an inspiring example for BICAN data management and stimulate session three panel discussion. Without further ado, take it away.
SIMON ILYUSHCHENKO: Thank you, Yong. Thank you for inviting me to this meeting. I'm happy to be here. Can you see my slides?
YONG YAO: Yes.
SIMON ILYUSHCHENKO: Excellent. So as you mentioned, I work as the data ingestion lead on Google Earth Engine, which means I run, build, and maintain the pipelines which ingest data. And for my opening screen, I picked a few images from our catalog which are geospatial images but which look like something you might be able to see in nature just to maybe emphasize some data connections between us. The structure of the talk is the following: I will talk very, very briefly about what Earth Engine does. There are links there, and you can follow the link below if you want to see the presentation yourself. I will talk a bit about the Earth Engine internals. How it's implemented.
I will try to allocate most of the time for general discussion of lessons and challenges, and I will leave a few minutes for questions. I will also be at the last panel at the end, so I can address some long questions there. To address any misconceptions possibly, Earth Engine is not the same as Google Earth. Google Earth is just the viewer. Earth Engine is a platform for scientists and developers to write your spatial applications. So it's a much more complex computational platform, something like MATLAB maybe.
Our focus is not just to build just a spatial platform. We specifically try to primarily address huge cases around climate change and natural resources management in general. A lot of our team joined the product to work on these problems specifically including me. Just to give you a taste of what Earth Engine can be used for, this is an animated timelapse of satellite imagery spanning decades. If you Google "Earth Engine Timelapse" you will see it's available online. So this is not just a static video. It's a zoomable video that you can, like on Google Maps, go in and out and see data at different scales.
Very brief history. We started the project in 2009. I'm not the founder, but I'm one of the first data person who joined the team, launched 2010, and just this year primarily we were a platform mostly for researchers academic community, but we started also adding commercial customers. And this year we became a proper Cloud Google API. This shows an advanced ML product we created this year. This is a land cover map, meaning that each big color represents different types of land cover: forest, urban water, and so on. The cool thing about this, this is not just one static map. It's produced for each new image coming from Sentinel-2 within 12 hours after observation. And we are not showing just one land cover class. We're showing probabilities of all 10 land cover classes. It's a whole big discussion. It's a very cool product. Follow links if you want to read more.
So this is what I'm in charge for, the public data catalog. Right now we have more than 70 petabytes of public data in almost 1,000 data sets. By size it's mostly satellite imagery, but we also have terrain data, elevation profiles. We have land cover. We have atmospheric data and so on. So the limitations we have, we only work with two-dimensional rasters and vectors, so we cannot do volumetric computations, but even this is very useful for a number of users. Just some academic credentials, we have been mentioned in a number of Nature and Science articles, including several that our team co-authored. We have 26 case color results, so it's been successful in this way. Of course our goal is not just to help publishing papers, but also help some change in the real world. An interesting challenge we are facing is how to actually define our impact metrics, but that's also a whole different story.
A few links to applications. I'm not going to go over them in detail. Please follow the links if you want to read more. So basically this is what you would imagine Google would do if it had all the compute power try to apply to remote science problems and climate change problems in general. A brief overview of how computations in Earth Engine work. We often start with specific images, like this is one satellite image taken for a particular allocation. Then you pretty often have to do some kind of processing to either compute some simple band arithmetic or apply some more complex algorithms like feature detection, edge detection, and so on.
The power of Earth Engine comes with scale, so typically image collections contain a number of very similar images which means we can treat them as a big list sort of like filter, map, reduce in Python or other languages. So we take a big collection. We filter it to just images we care about, for example, images with low cloud cover. We map the algorithm over this collection, then we reduce it to, for example, a single image using a something like median reducer, minimum-maximum percentile reducer, and so on. So we get some kind of mosaic representing this whole computation. And finally we might want to create some kind of derived statistics. We might want to export images, export some computed polygons, and so on. So we have integration with Earth Engine— so our integration with other tools is not great, but right now it's good enough for many workflows.
Something that we build the product around is the paradigm of moving the questions to the data, just because remote sensing data is so big, 70 petabyte is nothing. We are looking at hundreds of petabytes and probably exabytes coming from different sources in the next 5 to 10 years. So we have to be ready for that. And so far the easiest way to do this was to just run our computation jobs very close to the data. We actually reaching some limits around this, which I'll talk about later. But for now, this has been very successful.
One property of geospatial data, which makes it easy for us to build this product and might not apply to the same degree to your domain is data locality. Very often if you have a time series of images, you can run a particular computation like a median and so on, just for each individual pixel in time series or at worst maybe for a small neighborhood around this pixel. Which means it's almost embarrassingly parallel, which is a computing term. We can break computations into fairly small chunks, which can be sent to different servers, and then we can wait for them to be recombined. This isn't always true. So exception is for example if you look at river drainage basins. When you look for example, at the mouth of a river and you want to find all the pixels which drain into that particular pixel in the mouth of the river, then sometimes if it's Mississippi, you have to go all the way across the continent. And this is not a local computation, but most computations are local.
Our architecture is in general what you would expect. We have clients. You can either write Earth Engine code in the web browser, or you can use it from Python or other desktop clients. But the actual computations run on the server, and client communicate via particular APIs in the backend. We obviously store the information somewhere. The most interesting part, and this is described partially in the paper I'm linking to, is the computation in the middle. We had to build our own computation distribution framework, which is specifically oriented for geospatial data. And so we know how to break down computations, shard them, and then bring the results back. We have two types of operation. For computations which are relatively small and take under five minutes, we can just compute results interactively, which is very convenient for just browsing something in our code editor and just experimenting. So it's one of the most powerful features. But of course some computations are much larger. And so we also use the same exact code to start batch tasks, which can run for hours or days.
Another very convenient thing that we can take advantage of, which might or might not be applicable in your domain is image pyramiding. If you've used Google Maps or similar products, you've probably seen that there are tiles of the same size at different scales, and if you zoom in and out different tiles get shipped to your client so that, you can observe data at multiple zoom levels. So when the image comes in, typically distributed just at the native zoom level. And then before we make the image available, we do what's called pyramiding. We take four neighboring pixels, compute the average or maybe mode if it makes more sense and then create an image which is half in each dimension. And then keep going until we reach a small tile. And this tile is useful if you want to look for example at the whole world.
So now bit more details about our storage. We use an internal storage system which is cheap and fast. But unfortunately, it means if somebody else wanted to recreate the system outside of Google, this wouldn't be so easy. So as I mentioned, we use pyramided files, and actually now there is an open industry standard called COGs, cloud optimized GeoTIFFs, which actually ships the data with this pyramid already precomputed, which is convenient. We employ lots and lots of caching of several layers, including SSD cache, which is very heavily used. And our storage is so interesting that we actually cannot just use the standard internal Google tools. We have to write a little bit of our own to make sure that, for example, moving data from cluster to cluster is relatively efficient.
All right. Now, what can we recommend for people in your domain? So as I mentioned, we are trying to run computations close to the data. And after 12 years of doing this, we are starting, even within Google, to run into internal compute limits. It's not easy to just say, "Hey, I want 200 petabytes in a nearby cluster because I need to spin up a new replica." You have to be kind of careful about selecting new locations. And so the bad news is, we wish we had some recommendations how to do this distribution of data and computations in general. Right now we don't. We still are figuring this out. The good news is, we are forced to figure this out. And so maybe in a couple of years, if you come back to us, we might have some more general recommendations which apply to other computation types, not just spatial computations.
Obviously, at this scale, you have to be careful about your platform choice and tool choice, because once you commit to a particular cloud, you might be stuck with it. Of course people are aware of this, and they are trying to use some kind of multi-cloud infrastructure. This is pretty hard, but this is something we have to deal with in the modern world of cloud computing. And to kind of confirm my points, just a few days ago I was watching a talk by Peter Norvig, a prominent ML/AI scientist. And he also mentioned that it's not so much ML. It's interestingly just regulating the data that goes into the ML pipelines, is the hardest part, even for them.
Our current user experience is not great, and I think many platforms which try to do a lot of things will run into similar problems. Our interoperability leaves something to be desired. In general right now there is no good standard in the open-source world for these kind of computations, so next year we are trying to improve how we import and export data to be combined with other tools. We are not open source. It would be super hard to make our backend open source, though sometimes we'd like to, just because we are tied too much into the internal Google tools. Just because it's a new coding environment, it means it's very hard to debug scripts, especially when they are parallel. If you've written any parallel code, you probably know that it's not just simple matter of print error debugging. Sometimes you run into very weird bugs, and it's difficult even for us. So we are trying to give users enough tools to do this better. This is also one of our main focuses next year. And one particular subtype, if your backend tasks run out of memory, then this is very hard to debug also.
But even forgetting about particular platform peculiarities, when we run something like this at large scale, some problems come up which everybody has to address. Correctness is a problem. The data might have bugs. Computations might have bugs. And so sometimes at large scale, they might be hard to detect because they might only be in some edge cases like a polygon, which is not quite the whole world, but about half of the world, and so on. So this is always something we have to keep an eye on. Reproducibility is an interesting goal, coming from academic environments when users correctly expect us to produce results which are reproducible, but realistically because underlying data change, underlying algorithms change, we fix bugs, it's not actually always easier to just go and reproduce something. So we find this is not just a binary zero, one thing. This is actually more of a spectrum. We can try to make things reproducible. This is the further we push, the harder it is.
Just dealing with large data sets is actually pretty challenging. Again, you would imagine, that data set is a chunk which if it's ingested it's present. It really is more like a living entity. You have lots of images. They might have different processing because it's been ingested over months, or maybe even years. Some images might be broken. Some images might be in strange state. So a lot of our time is spent just trying to make sense of what's happening with the data. Curation is obviously a problem. There are many more data sets than we can ingest, so we need to be thoughtful about how we listen to the users and how we choose what we should ingest next, what features we need to build into the platform. If it's more than just one-time ingestion. If it's something that's produced every day or every week, then it means somebody has to run the code to ingest this which means somebody needs to alter this pipeline. Should it be us? Should it be users? Who pays for it? It always opens a lot of questions.
Versioning in particular is a problem because even if, for example, some satellite images have been reproduced fully two or three times, but some time people are stuck using older versions of the data because that's what their code is hardcoded to use. At some point we, for example, we are making a breaking change in the client. And somebody said to us, "Hey, I have this seven-year-old client which I can't really upgrade. Please don't do this." So we say that, "I'm sorry, we have to upgrade this." So seven years is too much.
At even high level, and this is probably more familiar to you, is we're trying to solve the mental model, but especially at the cutting edge of science, there are no common mental models. Everybody's doing something different. So if you compute precipitation, what kind of precipitation is this? What models have you used? Which area is it over? Is it over a day, is it over a week, and so on. And obviously often have to question terms. What is a forest? There's no one definition of forest. How tall are the trees? How far apart are they are? How do we make sure that you don't count oil plantations as forests because they are not? This is bad for biodiversity. And just annotating all this, I would say we haven't even really started because this basically means creating an ontology, creating a lot of documentation, and so this is very hard open questions.
So this is probably the most important slides. I tried to collect the information that we would use as advice. Some of our lead developers said that number one, don't generalize too early is very important. So just I would say, try to collect particular use cases. Try to generalize just enough to address those use cases. Don't go too far because most likely some generalizations will be wrong. Obviously, listen to user feedback as early as possible. Make sure people can discuss things with themselves or with yourself. Provide some kind of forums for them to share experience.
One thing I should mention, Earth Engine is popular because it harmonizes data. Instead of dozens of different formats, we present data to be as similar as possible even though they're coming from different sources. Obviously, there are limits, but just seeing all the collection from various satellites as similar type of collection is very powerful. Maintenance becomes obviously, a problem, the bigger the platform is. Big data means big errors. Something always goes wrong, and so software data maintenance has to be budgeted for. QA checks have to be run as much as possible, and even then, things will be missed. So there should be manual spot checks.
A few links to similar system. I won't go over them, but if you want to follow, nobody is probably at the exact same level as we are, but people are obviously building platform in the same direction. And a few kind of general terms emerging are data cube, also analysis ready data. STAC is a new open standard for describing the geospatial data. And yeah, in general I would recommend reaching out to the Earth sciences community. I just came from a conference called AGU which is still running in Chicago this week. And there are several sessions and tracks talking about data management, ontologies, and so on. So I will stop here, and if you have any quick questions I will take a couple.
YONG YAO: Maybe you have time for one quick question?
SIMON ILYUSHCHENKO: Go ahead.
YONG YAO: So in the absence of a quick question from everyone else, I have one which I think has already been addressed in the chat which is that what we've been talking about is sort of anatomical data, but in the pathology and radiology community, it's not only visualization. It's also generating AI-based diagnostic systems. So is there anything that you said in your discussion that is or isn't relevant to diagnostic applications for last-scale machine learning?
SIMON ILYUSHCHENKO: So I'd don't know anything about diagnostics. I can talk about AI. So we have some built-in AI capabilities, but we also connect to AI platforms. I would say they make, especially provenance and expandability even harder because if there's just simple math, you can go and simply debug this math. If you have an AI model, I think that one of the biggest challenges for AI community in general, how do you explain what your model does? Is it even correct? And obviously, AI is much harder to incorporate. So all those challenges kind of double or quadruple.
YONG YAO: Thank you.