Climate Change#


\(^{1}\)Image credit: MysteryShot via iStock

Mar, 2023

Stats Probability

Background#

Climate is changing, we all know that. There is no doubt temperatures are rising and weather patterns are shifting. Science repeatedly confirms that.

Besides, when I was a kid, I remember winters were colder and it used to rain more.
But wait! This is now my memory talking. And I know it can be misleading. Where really colder and rainier those distant days in the seventies and eighties?

There is no doubt about the global warming, but is it also happening in my town?
I wanted to check it out by myself.

Hide code cell source
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import my_functions as my

The data#

I headed for the website of the Spanish national weather agency AEMET, where open data can be downloaded. There is a weather station right in my town, so I selected it, but only data from 2008 onwards was available. So I chose instead the main station in the province, the meteorological observatory in Igueldo (San Sebastian).

Igeldo
\(^{1}\)Image credit: AEMET

This historical observatory started working in 1905, and the available data spans from 1928, so it holds the second longest weather data series in Spain. I also learned from the internet that measurement methods have not changed since then, and that made the data very valuable and this observatory a member of the international organization for the study of the climate change.

Hide code cell source
# Init
dates = []
temperatures = []
rainfalls = []

# Iterate through years
for year in range(1928, 2023):
    
    # Read json file
    with open(f"data/aemet/aemet-ss-{year}.json", encoding='utf-8') as file:
        data = json.load(file)

   # Iterate through months of the year
    for month in range(12):
        
        # Fetch average monthly temperature
        if 'tm_mes' in data[month]:
            temperature = float(data[month]['tm_mes'])
            temperatures.append(temperature)
        else:
            temperatures.append(np.nan)
        
        # Fetch monthly rainfall
        if 'p_mes' in data[month]:
            rainfall = float(data[month]['p_mes'])
            rainfalls.append(rainfall)
        else:
            rainfalls.append(np.nan)

        # Store date
        dates.append(str(year) + "-" + str(month + 1))
        
    
# Build pandas dataframe from stored data
aemet_ss = pd.DataFrame({"temp": temperatures,
                         "rainfall": rainfalls},
                        index=dates)
aemet_ss.index = pd.to_datetime(aemet_ss.index)
aemet_ss
temp rainfall
1928-01-01 NaN 179.8
1928-02-01 NaN 42.0
1928-03-01 NaN 184.5
1928-04-01 NaN 115.7
1928-05-01 NaN 125.8
... ... ...
2022-08-01 21.4 76.5
2022-09-01 18.6 147.6
2022-10-01 19.6 40.6
2022-11-01 13.4 317.3
2022-12-01 11.9 82.4

1140 rows × 2 columns

The data was downloaded as JSON files (one for each year: 95 files-years), and from them monthly average temperatures and monthly accumulated rainfalls were gathered.

Data validation#

Hide code cell source
# Print dataframe info
aemet_ss.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1140 entries, 1928-01-01 to 2022-12-01
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   temp      1132 non-null   float64
 1   rainfall  1140 non-null   float64
dtypes: float64(2)
memory usage: 26.7 KB

There are 8 month-temperatures missing.

Hide code cell source
# Show missing values
aemet_ss[aemet_ss["temp"].isna()]
temp rainfall
1928-01-01 NaN 179.8
1928-02-01 NaN 42.0
1928-03-01 NaN 184.5
1928-04-01 NaN 115.7
1928-05-01 NaN 125.8
1928-09-01 NaN 19.8
1936-09-01 NaN 77.0
1936-10-01 NaN 166.6

Temperature data is missing for six months in 1928, so I am going to remove that year altogether. Two months are also lacking data in 1936 (beginning of the Spanish War). I am going to assign them the mean value of the closest known month temperatures.

Hide code cell source
# Discard 1928
aemet_ss = aemet_ss[aemet_ss.index.year != 1928]

# Fill in missing 1936 months with neighbouring values
mean_value = np.mean([aemet_ss.loc['1936-08-01', 'temp'], aemet_ss.loc['1936-11-01', 'temp']])
aemet_ss = aemet_ss.fillna(mean_value)

# Print dataframe info
aemet_ss.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1128 entries, 1929-01-01 to 2022-12-01
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   temp      1128 non-null   float64
 1   rainfall  1128 non-null   float64
dtypes: float64(2)
memory usage: 58.7 KB

The data is now ready to proceed with the analysis.

Temperatures#

Let’s start exploring the temperatures!

Exploratory Data Analysis#

I will plot yearly average temperatures together with their average value of the last 25 years. This rolling window of 25 years draws a smoother curve and allows to see possible trends.

Hide code cell source
# Calculate yearly average temperatures
yearly_temp = aemet_ss.groupby(aemet_ss.index.year)["temp"].mean().to_frame()

# Define rolling window
window = 25 # Years

# Plot
fig, ax = plt.subplots(figsize=(14, 5))
yearly_temp.plot(ax=ax, marker='.', linestyle="-.", linewidth=0.5, color='purple')
yearly_temp.rolling(window=window).mean().plot(ax=ax, color='black')

ax.grid(axis="y")
ax.set_axisbelow(True)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.set_xticks([1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020])
ax.tick_params(axis='y', labelsize=14)
ax.set_title("", size=15)
ax.set_xlabel("")
ax.set_ylabel("Temperature (ºC)", size=14)
ax.legend(["Yearly avg. temperature", f"Avg. rolling last {window} years"], fontsize=13)
sns.despine()
ax.set_xlim(1928,2022)

plt.show()
../_images/767b86c487b1e15d533372dd8c280866a87da44212397b82d53ccd55557a9857.png

Average temperatures do seem to be increasing lately according to the graph.

To evaluate the differences and proceed with the analysis, let’s classify the data into three groups:

  • Temperatures from the last 25 years (from 1998 to 2022)

  • Temperatures from the previous last 25 years (from 1973 to 1997)

  • Temperatures from the previous to the previous last 25 years (from 1948 to 1972)

Hide code cell source
# Split and convert to numpy array
temp_48_72 = yearly_temp.loc[1948:1972].to_numpy().flatten()
temp_73_97 = yearly_temp.loc[1973:1997].to_numpy().flatten()
temp_98_22 = yearly_temp.loc[1998:2022].to_numpy().flatten()

# Create dataframe
df_temp = pd.DataFrame({"temp_48_72": temp_48_72,
                        "temp_73_97": temp_73_97,
                        "temp_98_22": temp_98_22,
                       }).melt()

# Make bee swarm plot
fig, ax = plt.subplots(figsize=(7.5, 5))
sns.swarmplot(ax=ax, x="variable", y="value", data=df_temp,
              hue="variable", palette=["orange", "blue", "red"])
sns.boxplot(ax=ax, x="variable", y="value", data=df_temp,
            boxprops=dict(linewidth=1, facecolor='white', edgecolor='grey', alpha=1),
                whiskerprops=dict(linewidth=1, color='grey', alpha=1),
                medianprops=dict(linewidth=1, color="grey", alpha=1),
                capprops=dict(linewidth=1, color='grey', alpha=1))

ax.grid(axis="y")
ax.set_axisbelow(True)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_title("Annual temperatures", size=15)
ax.set_xlabel("")
ax.set_ylabel("Temperature (ºC)", size=14)
ax.legend().set_visible(False)
ax.set_xticklabels(["1948 - 1972", "1973 - 1997", "1998 - 2022"], rotation=0)
sns.despine()

plt.show()
../_images/544dec6446310f868579eafb78db887a46fd1100a206c79aecc6d6419317c592.png
Hide code cell source
fig, ax = plt.subplots(figsize=(7.5, 5))

sns.pointplot(ax=ax, x="variable", y="value", data=df_temp,
              hue="variable", palette=["orange", "blue", "red"])

ax.set_title("Mean values (+95% ci)", size=15)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_xlabel("", size=14)
ax.set_ylabel("Temperature (ºC)", size=14)
ax.legend().set_visible(False)
ax.set_xticklabels(["1948 - 1972", "1973 - 1997", "1998 - 2022"], rotation=0)
sns.despine()

plt.show()
../_images/9927212f6c2a6bceb3573847d72199126cb1aefb4026118b85941ea9e409804b.png
Hide code cell source
# Establish random numbers' seed for reproducibility
np.random.seed(111)

# Draw bootstrap replicates
bs_reps_48_72 = my.draw_bs_reps(temp_48_72, np.mean, size=10000)
bs_reps_73_97 = my.draw_bs_reps(temp_73_97, np.mean, size=10000)
bs_reps_98_22 = my.draw_bs_reps(temp_98_22, np.mean, size=10000)

# Plot
fig, ax = plt.subplots(figsize=(7.5, 5))

sns.histplot(bs_reps_48_72, ax=ax, kde=True, stat="density", element="step",
             color="orange", label="1948 - 1972")
sns.histplot(bs_reps_73_97, ax=ax, kde=True, stat="density", element="step",
             color="blue", label="1973 - 1997")
sns.histplot(bs_reps_98_22, ax=ax, kde=True, stat="density", element="step",
             color="red", label="1998 - 2022")

ax.set_title("Mean-values distributions", size=15)
ax.tick_params(axis='x', labelsize=12, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_xlabel("Temperature (ºC)", size=14)
ax.legend()

sns.despine()
    
plt.show()
../_images/f9b8aebea10336b27f0bf7d324b0996b89611bd1d2069c0c33848d28b3a11456.png

Clearly, there has been a hike in annual average temperatures in the las 50 years, the difference is clear between ‘1973-1997’ and ‘1998-2022’ groups.

Hide code cell source
# Compute the difference of the sample means
mean_diff = np.mean(temp_98_22) - np.mean(temp_73_97)

print(f"Difference of sample means\nfrom 1973-1997 to 1998-2022 -> +{mean_diff:.2f} ºC")
Difference of sample means
from 1973-1997 to 1998-2022 -> +0.75 ºC

How significant is this increment?
Are temperatures really on the rise, or were this higher temperatures the result of just chance?

We are going to do hypothesis testing to evaluate it.

Hypothesis testing#

The null hypothesis takes the stand of the climate change denial: that is, the measures from ‘1973-1997’ and ‘1998-2022’, they both come from the same data distribution because climate has not changed, and so the difference we obtained in temperatures occurred just by chance.

We will do a permutation test to check if this could be the case. We will scramble the data from both sets in what is known as permutation samples to simulate this belonging to the same source.

Let’s first explore visually in a ECDF (Empirical Cumulative Distribution Function) how permutation samples relate to the original data sets.

Hide code cell source
# Plot
fig, ax = plt.subplots(figsize=(7.5, 5))

# Number of permutation samples to draw
n_samples = 100

# Iterate
for _ in range(n_samples):
    # Generate permutation samples
    perm_sample_1, perm_sample_2 = my.permutation_sample(temp_73_97, temp_98_22)

    # Compute ECDFs
    x_1, y_1 = my.ecdf(perm_sample_1)
    x_2, y_2 = my.ecdf(perm_sample_2)

    # Plot ECDFs of permutation sample
    ax.plot(x_1, y_1, marker='.', linestyle='none', color='blue', alpha=0.02)
    ax.plot(x_2, y_2, marker='.', linestyle='none', color='red', alpha=0.02)


# Now create and plot ECDFs from original data
x_1, y_1 = my.ecdf(temp_73_97)
x_2, y_2 = my.ecdf(temp_98_22)
ax.plot(x_1, y_1, marker='.', linestyle='none', color='blue', label="1973-1997")
ax.plot(x_2, y_2, marker='.', linestyle='none', color='red', label="1998-2022")

ax.set_axisbelow(True)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_title("Annual mean temperatures", size=15)
ax.set_xlabel("Temperature (ºC)", size=14)
ax.set_ylabel("ECDF", size=14)
ax.legend(loc='lower right', fontsize=13)
sns.despine()

plt.show()
../_images/a89e05f053dbdbd1028265e5ad073bbb2faee5ae43fb1fe732eb1f9840a5aa60.png

We see that permutated data sets (in hazed purple points) do not overlap with the original temperature sets, suggesting there is a significant difference among them. Let’s prove it!

To test this hypothesis we will use the difference of means as the test statistic..

Hide code cell source
def diff_of_means(data_1, data_2):
    """Difference in means of two arrays."""

    # The difference of means of data_1, data_2
    diff = np.mean(data_1) - np.mean(data_2)

    return diff

We will compute the probability of getting a difference in mean temperatures of 0.75 ºC or more, under the null hypothesis that the distributions of the two groups (‘1973-1997’ and ‘1998-2022’) are identical.

Hide code cell source
# Establish random numbers' seed for reproducibility
np.random.seed(111)

# Draw 10,000 permutation replicates
perm_replicates = my.draw_perm_reps(temp_98_22, temp_73_97, diff_of_means, size=10000)

# Compute p-value
p = np.sum(perm_replicates >= mean_diff) / len(perm_replicates)

print(f"p-value = {p}")
p-value = 0.0001

The p-value says that after randomly scrambling the data from the two groups 10000 times and making two groups each time, only once the difference between mean values of each group was equal or higher than the difference among the original groups. That is to say, the probability of observing this difference in mean temperatures if there was not such a thing as two different groups would be just of 0.01 %.

Graphically looks this way.

Hide code cell source
# Plot
fig, ax = plt.subplots(figsize=(7.5, 5))

sns.histplot(perm_replicates, ax=ax, kde=True, stat="density", element="step",
             color="purple")

ax.set_title("Mean-difference distribution\nfrom permutations samples of 1973-1997 and 1998-2022", size=15)
ax.tick_params(axis='x', labelsize=12, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_xlabel("Temperature (ºC)", size=14)
ax.axvline(mean_diff, linestyle="--", color="black", linewidth=1)
ax.annotate("Prob. temp. diff.\n0.75ºC or more",
            xy=(0.8, 0.01), 
            xytext=(0.8, 0.25),
            arrowprops={"arrowstyle":"-|>", "color":"black"})
sns.despine()
    
plt.show()
../_images/eea1522b69e00a83a66d9168a1ef2ecf1257f17d25ce0a65efad907a1f26d099.png

Extremely low probability, so we reject the null hypothesis and conclude that the two temperature groups must be indeed different and come with different distributions, i.e. temperatures are rising.

Rainfall#

Was it rainier when I was a kid? That is what I thought. But my memories are probably skewed, maybe in part because of the awareness about the ongoing climate change, which suggests drier weather conditions nowadays.

Let’s check it out!

Exploratory Data Analysis#

Hide code cell source
# Calculate yearly average rainfall
yearly_rainfall = aemet_ss.groupby(aemet_ss.index.year)["rainfall"].sum().to_frame()
yearly_rainfall_avg10 = yearly_rainfall.rolling(window=window).mean().reset_index(drop=True)

# Plot
fig, ax = plt.subplots(figsize=(14, 5))

yearly_rainfall.plot(kind="bar", color='lightseagreen', ax=ax)
yearly_rainfall_avg10.plot(kind="line", color='black', ax=ax)

ax.grid(axis="y")
ax.set_axisbelow(True)
ax.tick_params(axis='x', labelsize=9, rotation=90)
ax.tick_params(axis='y', labelsize=14)
ax.set_title("", size=15)
ax.set_xlabel("")
ax.set_ylabel("Rainfall (mm)", size=14)
ax.legend([f"Avg. rolling last {window} years", "Yearly rainfall"],
          loc='upper left', fontsize=13)
sns.despine()

plt.show()
../_images/8055b57e1a1cf9e77dd363fd1318f737da3d523db33dcb007f94c5831fc62049.png

Yikes, it looks like average rainfall has not changed during all these years!

Hide code cell source
# Split and convert to numpy array
rain_48_72 = yearly_rainfall.loc[1948:1972].to_numpy().flatten()
rain_73_97 = yearly_rainfall.loc[1973:1997].to_numpy().flatten()
rain_98_22 = yearly_rainfall.loc[1998:2022].to_numpy().flatten()

# Create dataframe
df_rain = pd.DataFrame({"rain_48_72": rain_48_72,
                        "rain_73_97": rain_73_97,
                        "rain_98_22": rain_98_22,
                       }).melt()

# Make bee swarm plot
fig, ax = plt.subplots(figsize=(7.5, 5))
sns.swarmplot(ax=ax, x="variable", y="value", data=df_rain,
              hue="variable", palette=["grey", "orange", "green"])
sns.boxplot(ax=ax, x="variable", y="value", data=df_rain,
            boxprops=dict(linewidth=1, facecolor='white', edgecolor='grey', alpha=1),
                whiskerprops=dict(linewidth=1, color='grey', alpha=1),
                medianprops=dict(linewidth=1, color="grey", alpha=1),
                capprops=dict(linewidth=1, color='grey', alpha=1))

ax.grid(axis="y")
ax.set_axisbelow(True)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_title("Yearly rainfall", size=15)
ax.set_xlabel("")
ax.set_ylabel("Rainfall (mm)", size=14)
ax.legend().set_visible(False)
ax.set_xticklabels(["1948 to 1972", "1973 to 1997", "1998 to 2022"], rotation=0)
sns.despine()

plt.show()
../_images/51f15894f321b1ee6b9b10a91a19dbd27b1d865c9cbabb3df94930551f0e95b4.png

If we make three groups like in the case of the temperatures, we see very similar measure distributions for the rainfalls.

Hide code cell source
fig, ax = plt.subplots(figsize=(7.5, 5))

sns.pointplot(ax=ax, x="variable", y="value", data=df_rain,
              hue="variable", palette=["grey", "orange", "green"])

ax.set_title("Mean values (95% ci)", size=15)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_xlabel("", size=14)
ax.set_ylabel("Rainfall (mm)", size=14)
ax.legend().set_visible(False)
ax.set_xticklabels(["1948 - 1972", "1973 - 1997", "1998 - 2022"], rotation=0)
sns.despine()

plt.show()
../_images/d04d707df35e6c2107951828eafa16d0334c63d268ae97fe40303dd2e1d026e6.png

Mean values are quite similar, with confidence intervals overlapping.

Hide code cell source
# Draw bootstrap replicates
bs_reps_48_72 = my.draw_bs_reps(rain_48_72, np.mean, size=10000)
bs_reps_73_97 = my.draw_bs_reps(rain_73_97, np.mean, size=10000)
bs_reps_98_22 = my.draw_bs_reps(rain_98_22, np.mean, size=10000)

# Plot
fig, ax = plt.subplots(figsize=(7.5, 5))

sns.histplot(bs_reps_48_72, ax=ax, kde=True, stat="density", element="step",
             color="grey", label="1948 - 1972")
sns.histplot(bs_reps_73_97, ax=ax, kde=True, stat="density", element="step",
             color="orange", label="1973 - 1997")
sns.histplot(bs_reps_98_22, ax=ax, kde=True, stat="density", element="step",
             color="green", label="1998 - 2022")

ax.set_title("Mean-values distributions", size=15)
ax.tick_params(axis='x', labelsize=12, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_xlabel("Rainfall (mm)", size=14)
ax.legend()

sns.despine()
    
plt.show()
../_images/ccb80e448394809c5edd126656e08e1c4d041da1821bd6005b02c82582272358.png

They all overlap and look similar. It does not seem there has been any change during the years.

Hide code cell source
# Compute difference of mean
empirical_diff_means = np.mean(rain_98_22) - np.mean(rain_73_97)
print(f"Empirical diff. in mean\nfrom 1973-1997 to 1998-2022 -> {empirical_diff_means:.2f} mm")
Empirical diff. in mean
from 1973-1997 to 1998-2022 -> -8.69 mm

The difference in rainfalls looks insignificant.

Hide code cell source
# Plot
fig, ax = plt.subplots(figsize=(7.5, 5))

# Number of permutation samples to draw
n_samples = 100

# Iterate
for _ in range(n_samples):
    # Generate permutation samples
    perm_sample_1, perm_sample_2 = my.permutation_sample(rain_73_97, rain_98_22)

    # Compute ECDFs
    x_1, y_1 = my.ecdf(perm_sample_1)
    x_2, y_2 = my.ecdf(perm_sample_2)

    # Plot ECDFs of permutation sample
    ax.plot(x_1, y_1, marker='.', linestyle='none', color='lightseagreen', alpha=0.02)
    ax.plot(x_2, y_2, marker='.', linestyle='none', color='lightseagreen', alpha=0.02)

    
# Now create and plot ECDFs from original data
x_1, y_1 = my.ecdf(rain_73_97)
x_2, y_2 = my.ecdf(rain_98_22)
ax.plot(x_1, y_1, marker='.', linestyle='none', color='orange', label='1973-1997')
ax.plot(x_2, y_2, marker='.', linestyle='none', color='green', label='1998-2022')

ax.set_axisbelow(True)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_title("", size=15)
ax.set_xlabel("Yearly rainfall (mm)", size=14)
ax.set_ylabel("ECDF", size=14)
ax.legend(loc='lower right', fontsize=13)
sns.despine()

plt.show()
../_images/6c8e7231a940536a7757188e6a5659a25e670da5f722988544877dff060da2b3.png

We see that permutated data sets (in hazed points) do overlap with the original rainfall sets (orange and green), suggesting there is not a significant difference. I conclude that the two rainfall groups are indeed the same and come with the same distributions, i.e. there is no difference in rainfalls (so my memory was in fact skewed).

Conclusion#

In this little project permutational testing was used to assess whether temperatures and rainfalls are changing locally because of the climate change. I came to the conclusion that temperatures are certainly rising, and rainfalls remain the same.

One of the best things about doing Data Science is the thrill you get from collecting raw data and drawing your own conclusions. Even more if the subject is such an important and universal topic as Climate Change. You are doing Science, probably not high-level science, but science anyway!