My Geospatial Data Analysis Journey - Part 1

Blood, sweat, and tears along with the excitement of discovering new things

Posted by Tanya Dixit on April 10, 2022 · 3 mins read

The key to good geospatial data analysis is understanding the data. Why did that come as a surprise? Just kidding! Data is KING!

Yes, understanding data is essential. I consider myself decent at doing exploratory data analysis, but when I started looking at the various geospatial data sources my team had lined up, I was confused, to say the least, and f***ing overwhelmed in reality.

In order to make some sense of this data, what I did may come as no surprise to those who have gone through the same journey.

I started writing code!!

I started with gdal, a library for geospatial analysis everyone uses. And I started with the easiest format - GeoTIFF.

I should give you some idea of what my end goal in this exercise is - I want to create a fusion engine that will combine various disparate sources of data in a way that they can be fed directly into a model that maps bushfire risk spatially. The task seems harmless, but the sheer number of data formats I am dealing with, and the data sources (~25 sources), let’s say it’s not a straightforward data loader task which I had gotten pretty good at.

I broke this problem down into a few abstractions that I started implementing. The first and most important one is getting all data into the same projection system. That one seemed easy, but still took me some time to implement. But it was relatively straightforward with gdal. And since I was only writing this for GeoTIFF at this point, it was mostly loading the GeoTIFF file, reading the bands, converting the data to the destination projection, and writing the bands one by one (yes you have to write band by band in a “for loop”).

One more thing that I learned and was absolutely delighted by was - the data type. I was wondering what happens if different bands have different data types. Also, the output raster should have the same data type as the input one. All these considerations helped me structure my thinking for the behemoth task that was to come.

That behemoth task is - to convert hdf files to a given projection system. An HDF (Hierarchical Data Format) file format is a hierarchical one (as indicated by the name). This means that in a single file, you have multiple “datasets”. An HDF file is a container for multiple datasets called SDS (Scientific Datasets) which are basically multidimensional arrays of data. Now each of these SDS in a HDF may be of different types, different sizes, and different projections even. This makes it super difficult to write a function that can handle projections for all HDF files.

So, in my next session, I will talk about how I handled that problem.

Till then, sayonara!

Happy Coding!!!!

Placeholder text by Space Ipsum. Photographs by Unsplash.