Canopy Height Map#

Apr, 2025

GIS raster dataset processing

This notebook processes canopy height maps of a selected area of interest. Once the area has been defined, corresponding GeoTIFF files are downloaded. These TIF files are tree height datasets, which are then converted into GeoPandas dataframes for analysis and for exporting the results to GeoJSON or CSV files.

Below is the flowchart illustrating the process in this notebook:

Reference:

High Resolution Canopy Height Maps by WRI and Meta Global and regional Canopy Height Maps (CHM). Created using machine learning models on high-resolution worldwide Maxar satellite imagery.

Documentation: facebookresearch/HighResCanopyHeight
Contact: dataforgood@meta.com
ManagedBy: Meta: https://dataforgood.fb.com/
Meta and World Resources Institude (WRI) - 2024. High Resolution Canopy Height Maps (CHM). Source imagery for CHM © 2016 Maxar.

Hide code cell source
import os

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import geopandas as gpd
from shapely.geometry import Point
import contextily as ctx

import rasterio
from rasterio.merge import merge
from rasterio.plot import show

import s3fs  # File interface to AWS S3

Download world tiles#

The data is stored on AWS as open data, so no AWS account is needed to access it.

Hide code cell source
# 'anon=True' disables signing for public data
fs = s3fs.S3FileSystem(anon=True)

# Define resource and location in S3
bucket = "dataforgood-fb-data"
path = "forests/v1/alsgedi_global_v6_float"
filename = "tiles.geojson"

# Define local path to save the file
local_path = "tiles.geojson"

# Download the file
fs.get(f"{bucket}/{path}/{filename}", local_path)

# Read GeoJSON into a GeoDataFrame
tiles = gpd.read_file(local_path)
tiles
tile geometry
0 023013213 POLYGON ((-115.3125 33.13755, -115.3125 33.724...
1 021210020 POLYGON ((-123.04688 54.57206, -123.04688 54.9...
2 130122211 POLYGON ((115.3125 56.94497, 115.3125 57.32652...
3 121022223 POLYGON ((46.40625 55.77657, 46.40625 56.17002...
4 310111000 POLYGON ((130.07812 -0.70311, 130.07812 0, 129...
... ... ...
56140 122321030 POLYGON ((30.23438 9.1021, 30.23438 9.79568, 2...
56141 132032310 POLYGON ((106.17188 23.88584, 106.17188 24.527...
56142 300131332 POLYGON ((44.29688 -16.63619, 44.29688 -15.961...
56143 030303300 POLYGON ((-58.35938 50.28934, -58.35938 50.736...
56144 130101111 POLYGON ((123.75 66.23146, 123.75 66.51326, 12...

56145 rows × 2 columns

Hide code cell source
tiles.plot()
plt.show()
../_images/96528429c8b384901175e52433ee76adebfa115142efc5c19cd4fb4c65e02845.png

The selected area could lie in between more than one tile as it can be observed opening the tiles.geojson file in a tool like https://geojson.io/:

Import area of interest#

For example, we can import GeoJSON file from a polygon drawn in https://geojson.io/:

Hide code cell source
area = gpd.read_file("zumaia.geojson")
area
geometry
0 POLYGON ((-2.25979 43.29366, -2.25385 43.2927,...
Hide code cell source
fig, ax = plt.subplots(figsize=(6, 4))
area.plot(ax=ax)
plt.show()
../_images/6a88ae007317ea137a3a663088b73a515c403aa1531fc380ece0e6a0c9131475.png

Find intersecting tiles#

By intersecting the global tiles with the area of interest, we can identify the relevant tiles and the geometry of the area that lies within the corresponding tile.

Hide code cell source
# Change coordinate reference to Web Mercator
tiles = tiles.to_crs("EPSG:3857")
area = area.to_crs("EPSG:3857")

# Get intersection
area_tiles = gpd.overlay(area, tiles, how="intersection")
area_tiles = area_tiles[["tile", "geometry"]]
area_tiles
tile geometry
0 031333122 POLYGON ((-251160.568 5357192.823, -250938.305...

In this case only one tile intersects with the area of interest.

Download cropped tiles#

Tiles with canopy height map datasets are provided in GIS raster format (TIF files) for download. However, due to their large size, only a subset will be downloaded, cropped to match the boundaries (X, Y, maximum and minimum) of the polygons.

Hide code cell source
cropped_tiles = []

# For each tile involved
for tile in area_tiles["tile"]:

    # Define existing file name in S3
    filename = f"{tile}.tif"

    # Define S3 path
    s3_url = f"s3://{bucket}/{path}/chm/{filename}"

    # Get the GeoTIFF file
    file = fs.open(s3_url)

    # Open it as a GIS raster dataset
    with rasterio.open(file) as dataset:

        # Get the boundaries of the area inside the tile
        min_x, min_y, max_x, max_y = area_tiles[area_tiles["tile"] == tile].union_all().bounds

        # Get the window to read
        window = rasterio.windows.from_bounds(min_x, min_y, max_x, max_y, transform=dataset.transform)

        # Read the 2D map, 1 band: canopy heights
        cropped_data = dataset.read(1, window=window)
    
        # Store the relevant metadata
        metadata = dataset.profile
        metadata.update({
            "driver": "GTiff",
            "count": 1,  # Single band
            "dtype": "float32",
            "width": cropped_data.shape[1],
            "height": cropped_data.shape[0],
            "transform": dataset.window_transform(window),
        })

        # Save the dataset as a .tif file
        cropped_filename = f"cropped_{tile}.tif"
        with rasterio.open(cropped_filename, "w", **metadata) as dst:
            dst.write(cropped_data, 1)

        # Store the file name for the composition later
        cropped_tiles.append(cropped_filename)

Compose the mosaic#

After they are downloaded, the individual tile sections must be merged into a new GIS raster file corresponding to the target area.

Hide code cell source
# Open datasets from cropped tiles
datasets = []
for cropped_tile in cropped_tiles:
    datasets.append(rasterio.open(cropped_tile))

# Show them using rasterio's show function
fig, ax = plt.subplots()
show(datasets[0], ax=ax, cmap="viridis")  # Only one in this case
plt.show()
../_images/40403e57bf48e5f50b853fe556d8b57c4a6aaef5ff84a902a1d725866e531ba5.png
Hide code cell source
# Merge raster files
mosaic_image, mosaic_transform = rasterio.merge.merge(datasets)

# Set up the metadata for the output file (using the profile of the first tile)
out_meta = datasets[0].profile
out_meta.update({
    "driver": "GTiff",
    "count": 1,  # Single band output
    "transform": mosaic_transform,
    "width": mosaic_image.shape[2],  # Width is the 3rd dimension of the 3D array (bands, height, width)
    "height": mosaic_image.shape[1],  # Height is the 2nd dimension
})

# Save the mosaic to a file
output_file = "mosaic.tif"
with rasterio.open(output_file, "w", **out_meta) as dst:
    dst.write(mosaic_image[0], 1)  # Since it's single-band, we need to write only the first band

# Close all datasets manually to ensure they're not locked
for dataset in datasets:
    dataset.close()  # Explicitly close each dataset

# Optionally, clean up the temporary files
for cropped_tile in cropped_tiles:
    os.remove(cropped_tile)

# Show the result
with rasterio.open(output_file) as dst:
    show(dst, cmap="viridis")
    plt.show()
../_images/40403e57bf48e5f50b853fe556d8b57c4a6aaef5ff84a902a1d725866e531ba5.png

Obtain canopy height map#

The GIS raster file is read into a dataset to extract the canopy height map information, which is then transferred into a GeoPandas dataframe.

Hide code cell source
with rasterio.open(output_file) as dataset:
    indexes = dataset.indexes # Supposedly only 1 band -> heights
    heights = dataset.read(indexes[0]) # Read 2D array: map of the heights
    geo_xy = dataset.xy # Get handy function for spatial indexing

    # Replace zeros with NaN in 2D array
    heights[heights == 0] = np.nan

    # Get 2D array indices of non-NaN values
    indices = np.where(~np.isnan(heights))

    # Get non-NaN values: heights (in meters)
    values = heights[indices]

    # Array positions -> Spatial coordinates
    X, Y = geo_xy(indices[0], indices[1])

    # Convert to geometry points
    points = [Point(xy) for xy in zip(X, Y)]

    gdf = gpd.GeoDataFrame({"height": values}, geometry=points, crs="EPSG:3857")

gdf
height geometry
0 1.0 POINT (-249426.436 5358106.388)
1 1.0 POINT (-249425.242 5358106.388)
2 1.0 POINT (-249424.048 5358106.388)
3 2.0 POINT (-249422.853 5358106.388)
4 2.0 POINT (-249421.659 5358106.388)
... ... ...
418192 2.0 POINT (-249474.209 5356630.198)
418193 2.0 POINT (-249473.015 5356630.198)
418194 2.0 POINT (-249471.821 5356630.198)
418195 1.0 POINT (-249470.626 5356630.198)
418196 1.0 POINT (-249469.432 5356630.198)

418197 rows × 2 columns

Hide code cell source
fig, ax = plt.subplots()
legend_kwds = {"label": "Canopy Height (meters)"}
gdf.plot(ax=ax, column="height", legend=True, legend_kwds=legend_kwds)
ctx.add_basemap(ax, crs=gdf.crs)
plt.show()
../_images/eaa3637d5dc23abd84abb840c0b8723b507c3164321e66266b6c63d7e59e3332.png

Clip the original area#

Lastly, the GeoDataFrame is clipped to retain only the original area.

Hide code cell source
gdf_area = gpd.sjoin(gdf, area)
gdf_area = gdf_area.drop("index_right", axis=1)
gdf_area
height geometry
4277 1.0 POINT (-249997.325 5357990.538)
4278 1.0 POINT (-249996.131 5357990.538)
4279 1.0 POINT (-249994.937 5357990.538)
4280 1.0 POINT (-249993.742 5357990.538)
4393 1.0 POINT (-249997.325 5357989.344)
... ... ...
417920 3.0 POINT (-250883.517 5356630.198)
417921 2.0 POINT (-250882.323 5356630.198)
417922 2.0 POINT (-250881.128 5356630.198)
417923 1.0 POINT (-250879.934 5356630.198)
417924 1.0 POINT (-250878.74 5356630.198)

298185 rows × 2 columns

Hide code cell source
fig, ax = plt.subplots()
legend_kwds = {"label": "Canopy Height (meters)"}
gdf_area.plot(ax=ax, column="height", legend=True, legend_kwds=legend_kwds)
ctx.add_basemap(ax, crs=gdf.crs)
plt.show()
../_images/aea50e715900b058efd564bee5ccefc79385c5ac4cab6f40e8031230bc2acfb6.png

The GeoPandas dataframe allows us to analyze the data — for instance, we can explore the distribution of heights:

Hide code cell source
fig, ax = plt.subplots(figsize=(6, 4))
sns.histplot(gdf_area["height"], ax=ax, stat="count", discrete=True)
ax.set_xlabel("Heights (meters)", fontsize=11)
sns.despine()
plt.show()
../_images/0feb2cb1d090f1e205b8f40b90f7a08dc067e2cf1bc9b65da58e08afc7294206.png

Export (CSV, GeoJSON)#

Hide code cell source
# Define output
filename = "chm"
gdf_area = gdf_area.to_crs("EPSG:4326")

# To GeoJSON
gdf_area.to_file(f"{filename}.geojson", driver="GeoJSON")

# To CSV, with latitude, longitude in columns
gdf_area["lat"] = gdf_area["geometry"].y
gdf_area["lon"] = gdf_area["geometry"].x
gdf_area = gdf_area.drop("geometry", axis=1)
gdf_area.to_csv(f"{filename}.csv")