Thursday, November 2nd, 2023 (about 23 hours ago)
Open source tools developed by the Pangeo ML community are enabling the shift to cloud-native geospatial Machine Learning. Join the Pangeo ML community working in towards scalable GPU-native workflows! 🚀
At FOSS4G SotM Oceania 2023 last month, we presented on "The ecosystem of geospatial machine learning tools in the Pangeo World" (see the recording here). One of the driving forces of the Pangeo community is to build better tools that will enable scientific workflows on petabyte-scale datasets, such as Climate/Weather projections that will impact the planet over the coming decades.
To do that, we need to be fast.
These next-generation tools need to be scalable, efficient, and modular. So we are designing them with three aspects in mind:
Neither of these core technologies are particularly new. NVIDIA has been leading the development of GPU-native RAPIDS AI libraries since 2018. Streaming has been around since the 2010s if not earlier, and is practically the most common way to consume music and video content nowadays. Since then, we have also seen the rise in multi-modal Foundation Models that are able to take in visual (image) and language (text and sound) cues.
Let's now take a step back, and picture what we're working with.
There are three main layers to a Machine Learning data pipeline. It starts with data storage file formats at the bottom row, an in-memory array representation in the middle, and high-level libraries and documentation resources that users or developers interact with at the top.
The key to connecting all of these layers are open standards.
For the file formats, we favour cloud-native geospatial because it allows us to efficiently access subsets of data without reading the entire file. Generally speaking, you would store rasters as Zarr or Cloud-Optimized GeoTIFFs, and vectors (points/lines/polygons) in FlatGeobuf or (Geo)Parquet. Ideally though, these files would be indexed using a SpatioTemporal Asset Catalog (STAC) which makes it easier to discover datasets using standardized queries. This can be a whole topic in itself, so check out this guide that was published last month for more details!
In the Python world, NumPy arrays have been the core way of representing arrays in-memory, but there are many others too, along with an ongoing movement to standardize the array/dataframe API at https://data-apis.org. Geospatial folks would most likely be familiar with vector libraries like GeoPandas GeoDataFrames (built on top of pandas); or raster libraries like rioxarray and stackSTAC that reads into xarray data structures.
NumPy arrays are CPU-based, but there are also libraries like CuPy which can do GPU-accelerated computations. Instead of GeoPandas, you could use libraries like cuSpatial (built on top of cuDF and part of RAPIDS AI) to run GPU-accelerated algorithms. Deep Learning libraries like PyTorch, TensorFlow or JaX tend to be GPU-based as well, but there are also libraries like Datashader (for visualization) and Xarray that are designed to be CPU/GPU agnostic and can hold either.
Finally, to make life simpler, we have high-level convenience libraries wrapping the low-level stuff. The Pangeo Machine Learning Working Group mostly works on Climate/Weather datasets, so we'll focus on multi-dimensional arrays for now.
TODO write more here ... kvikIO, xbatcher, zen3geo ...
Educational resources:
Pangeo ML Working Group:
The work above is the cumulative effort of folks from the Pangeo, Xarray and RAPIDS AI community, plus more! In particular, we'd like to acknowledge the work of Deepak Cherian at Earthmover and Negin Sobhani at NCAR for their work on cupy-xarray/kvikIO, Max Jones at Carbonplan for recent developments on the xbatcher package, and Wei Ji Leong at Development Seed for the development of zen3geo.