Analyzing global internet patterns

Analyzing global internet patterns#

A DataCamp challenge Jan, 2025

k-means clustering

Executive Summary#

This analysis categorizes countries into three distinct groups based on the rate and pattern of internet adoption between 2000 and 2022:

Early Adopters: Countries in this group experienced rapid internet usage growth, particularly in the early years, surpassing 60% by 2007 and exceeding 90% by 2022. The steep initial growth reflects early investment in internet infrastructure, followed by a plateau in recent years, indicating market saturation.
Mid Adopters: These countries showed a more gradual adoption curve, reaching 40% internet usage by 2012 and steadily growing. By 2022, nearly 80% of the population was online. However, a visible slowdown suggests they may not reach the adoption levels of early adopters, potentially plateauing below 90%.
Late Adopters: These countries saw much slower internet growth initially, but adoption accelerated after 2010. By 2022, internet usage reached approximately 40%, indicating continued growth. However, a recent tendency toward flattening at this lower adoption rate (~40%) suggests the curve may plateau prematurely.

In conclusion, early adopters are approaching full internet penetration, while mid and late adopters are showing signs of slower growth before reaching higher usage levels. Continued investment in infrastructure may be necessary to support further growth, particularly for late adopters.

The project#

The dataset highlights internet usage for different countries from 2000 to 2023. The goal is import, clean, analyze and visualize the data to understand how internet usage has changed over time and the countries still widely impacted by lack of internet availability.

The data#

Column name	Description
Country Name	Name of the country
Country Code	Countries 3 character country code
2000	Contains the % of population of individuals using the internet in 2000
2001	Contains the % of population of individuals using the internet in 2001
2002	Contains the % of population of individuals using the internet in 2002
2003	Contains the % of population of individuals using the internet in 2003
….	…
2023	Contains the % of population of individuals using the internet in 2023

Data validation#

	Country Name	Country Code	2000	2001	2002	2003	2004	2005	2006	2007	...	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023
0	Afghanistan	AFG	..	0.00472257	0.0045614	0.0878913	0.105809	1.22415	2.10712	1.9	...	7	8.26	11	13.5	16.8	17.6	18.4	..	..	..
1	Albania	ALB	0.114097	0.325798	0.390081	0.9719	2.42039	6.04389	9.60999	15.0361	...	54.3	56.9	59.6	62.4	65.4	68.5504	72.2377	79.3237	82.6137	83.1356
2	Algeria	DZA	0.491706	0.646114	1.59164	2.19536	4.63448	5.84394	7.37598	9.45119	...	29.5	38.2	42.9455	47.6911	49.0385	58.9776	60.6534	66.2356	71.2432	..
3	American Samoa	ASM	..	..	..	..	..	..	..	..	...	..	..	..	..	..	..	..	..	..	..
4	Andorra	AND	10.5388	..	11.2605	13.5464	26.838	37.6058	48.9368	70.87	...	86.1	87.9	89.7	91.5675	..	90.7187	93.2056	93.8975	94.4855	..
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
212	Virgin Islands (U.S.)	VIR	13.8151	18.3758	27.4944	27.4291	27.377	27.3443	27.3326	27.3393	...	50.07	54.8391	59.6083	64.3775	..	..	..	..	..	..
213	West Bank and Gaza	PSE	1.11131	1.83685	3.10009	4.13062	4.4009	16.005	18.41	21.176	...	53.6652	56.7	59.9	63.3	64.4	70.6226	76.01	81.83	88.6469	86.6377
214	Yemen, Rep.	YEM	0.0825004	0.0908025	0.518796	0.604734	0.881223	1.0486	1.24782	5.01	...	22.55	24.0854	24.5792	26.7184	..	..	13.8152	14.8881	17.6948	..
215	Zambia	ZMB	0.191072	0.23313	0.477751	0.980483	1.1	1.3	1.6	1.9	...	6.5	8.8	10.3	12.2	14.3	18.7	24.4992	26.9505	31.2342	..
216	Zimbabwe	ZWE	0.401434	0.799846	1.1	1.8	2.1	2.4	2.4	3	...	16.3647	22.7428	23.12	24.4	25	26.5883	29.2986	32.4616	32.5615	..

217 rows × 26 columns

Missing data is marked as “..”. I will replace “..” with NaN values to map out blank cells with the help of missingno library.

../_images/8588940a072a5b8d271596fdef5463a3e044114b5dbf1c734fa98357083d02bd.png

The data from many countries is missing for the last recorded year, 2023. Even if I interpolate it using data from the previous year, the trends will flatten, and this may distort the reading of reality, so I will drop 2023 column.

Convert numerical columns to float.

Check duplicates:

Duplicated rows in whole table -> 0
Duplicated rows in the numeric table -> 6

	Country Name	Country Code	2000	2001	2002	2003	2004	2005	2006	2007	...	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
39	Channel Islands	CHI	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
94	Isle of Man	IMN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
146	Northern Mariana Islands	MNP	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
172	Sint Maarten (Dutch part)	SXM	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
183	St. Martin (French part)	MAF	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
200	Turks and Caicos Islands	TCA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

6 rows × 25 columns

Duplicates correspond to numeric columns in countries with no data at all. I will drop these rows.

To further eliminate missing values I will linearly interpolate the gaps across columns (years).

In the case of consecutive NaNs, I will fill them both forward and backward, with a limited number of consecutive NaNs. But how do we determine the maximum number of consecutive NaNs to fill? Let’s analyze how many countries with remaining NaNs are removed based on the limit we choose.

../_images/f31007737f37659aa7f2bea6dc6956886de79faad57458dc6924bc383a3fa960.png

By limiting the filling of missing values to 5 consecutive years, we significantly reduce the number of countries that still have gaps. Assuming that values can be carried over for these 5 years may be a big assumption, but since I want to include the maximum number of countries in the study, I will use this as the cutoff criterion.

	Country Name	nans
3	American Samoa	23
50	Curacao	11
75	Gibraltar	1
103	Korea, Dem. People's Rep.	4
105	Kosovo	12
150	Palau	13
178	South Sudan	8

These are the countries that still have NaNs. Given their population relevance, I will take a closer look at South Sudan and North Korea to understand what’s going on with them.

	Country Name	Country Code	2000	2001	2002	2003	2004	2005	2006	2007	...	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
178	South Sudan	SSD	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	2.2	2.6	3.0	3.5	4.1	4.8	6.7	9.27123	9.64241	12.1407

1 rows × 25 columns

South Sudan has no initial records because it was only recently declared an independent country, which explains the NaNs. Since the existing values are very low, I will fill in the NaNs with 0 to include this country in the analysis.

	Country Name	Country Code	2000	2001	2002	2003	2004	2005	2006	2007	...	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
103	Korea, Dem. People's Rep.	PRK	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	NaN	NaN	NaN	NaN

1 rows × 25 columns

Missing values are just a prolongation of nul values, so as there is no information at all I will drop North Korea from the analysis together with the rest of the remaining smaller countries.

<class 'pandas.core.frame.DataFrame'>
Index: 205 entries, 0 to 216
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 Country Name  205 non-null    object 
 Country Code  205 non-null    object 
 2000          205 non-null    float64
 2001          205 non-null    float64
 2002          205 non-null    float64
 2003          205 non-null    float64
 2004          205 non-null    float64
 2005          205 non-null    float64
 2006          205 non-null    float64
 2007          205 non-null    float64
2008          205 non-null    float64
2009          205 non-null    float64
2010          205 non-null    float64
2011          205 non-null    float64
2012          205 non-null    float64
2013          205 non-null    float64
2014          205 non-null    float64
2015          205 non-null    float64
2016          205 non-null    float64
2017          205 non-null    float64
2018          205 non-null    float64
2019          205 non-null    float64
2020          205 non-null    float64
2021          205 non-null    float64
2022          205 non-null    float64
dtypes: float64(23), object(2)
memory usage: 41.6+ KB

Cluster analysis#

To identify any patterns in the evolution of internet usage across countries, I will proceed to cluster them into different groups using the k-means clustering algorithm.

../_images/f181f0d6829ff2f36922ec2be9407a5ca3e37ce94bb94dfbdb104c07d4584510.png

According to the Elbow Method, I choose “3” as the optimal number of clusters.

Therefore, samples will be labeled as either ‘0’, ‘1’, or ‘2’. Once labeled, I will plot the mean values for each group.

../_images/882a8f485c18f8a5e8b95062011d9c8243ab571b638616de815af67674c5152e.png

The graph shows that the clusters identified by the algorithm correspond to groups of countries with different internet adoption rates and patterns:

Cluster 0 -> I name it “late adopters”
Cluster 1 -> I name it “early adopters”
Cluster 2 -> I name it “mid adopters”

Now that we have identified the clusters, I will replace the labels for clusters ‘0’, ‘1’, and ‘2’ with their corresponding meaningful category names and plot the results using interactive graphs, allowing users to select countries to display alongside the group trends.

Show code cell source Hide code cell source

# Define categories
categories = ["early", "mid", "late"]

# Replace label number with category name
dfi["label"] = dfi["label"].replace(
    {0: categories[2], 1: categories[0], 2: categories[1]}
)
df["label"] = df["label"].replace(
    {0: categories[2], 1: categories[0], 2: categories[1]}
)

# Replace cluster label with category name
df_clusters_l["Cluster"] = df_clusters_l["Cluster"].replace(
    {"Cluster 0": categories[2], "Cluster 1": categories[0], "Cluster 2": categories[1]}
)
# Rename column name
dfi = dfi.rename(columns={"label": "Adopter"})
df = df.rename(columns={"label": "Adopter"})
df_clusters_l = df_clusters_l.rename(columns={"Cluster": "Adopter"})

# Convert the column to ordered categorical
dfi["Adopter"] = dfi["Adopter"].astype(
    pd.CategoricalDtype(categories=categories, ordered=True)
)
df["Adopter"] = df["Adopter"].astype(
    pd.CategoricalDtype(categories=categories, ordered=True)
)
df_clusters_l["Adopter"] = df_clusters_l["Adopter"].astype(
    pd.CategoricalDtype(categories=categories, ordered=True)
)

# Convert dataframes from wide to long format for plotting
dfi_long = dfi.melt(
    id_vars=["Country Name", "Country Code", "Adopter"],
    var_name="Year",
    value_name="Internet Usage",
)
df_long = df.melt(
    id_vars=["Country Name", "Country Code", "Adopter"],
    var_name="Year",
    value_name="Internet Usage",
)

# Define colors to match the ones that appeared in the clusters plotting
sns_tab10_blue = sns.color_palette("tab10").as_hex()[0]
sns_tab10_orange = sns.color_palette("tab10").as_hex()[1]
sns_tab10_green = sns.color_palette("tab10").as_hex()[2]

# Instance of plotly graphic object
fig = go.Figure()

# Add traces for each adopter type (cluster)
color_key = {"early": sns_tab10_orange, "mid": sns_tab10_green, "late": sns_tab10_blue}
for adopter in categories:  # ["early", "mid", "late"]:
    df_adopter = df_clusters_l.loc[df_clusters_l["Adopter"] == adopter, :]
    fig.add_trace(
        go.Scatter(
            x=df_adopter["Year"],
            y=df_adopter["Internet Usage"],
            mode="lines",
            name=f"{adopter} adopters",
            line=dict(color=color_key[adopter], dash="dot"),
            hovertemplate="%{fullData.name}<br> %{x:.0f}: %{y:.0f}%<extra></extra>",
        )
    )


# Add traces for each country
for country, adopter in zip(dfi["Country Name"], dfi["Adopter"]):
    df_country = df_long.loc[df_long["Country Name"] == country, :]
    fig.add_trace(
        go.Scatter(
            x=df_country["Year"],
            y=df_country["Internet Usage"],
            mode="lines",
            name=country,
            visible="legendonly",  # Hidden when first shown
            line=dict(color=color_key[adopter]),
            hovertemplate="%{fullData.name}<br> %{x:.0f}: %{y:.0f}%<extra></extra>",
        )
    )

# Layout parameters
fig.update_layout(
    # # Set width and height (in pixels)
    # width=1200,
    # height=600,
    # Set the range for the y-axis
    yaxis=dict(range=[0, 100]),
    # Set titles
    title="Internet Usage in the World",
    yaxis_title="% of the population",
)

fig.show()

# Plot world map choropleth
# Read into a geopandas dataframe countries' map and data
# Data downloaded from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/
gdf = gpd.read_file("map/ne_110m_admin_0_countries.shp")

# Select some columns of interest
gdf = gdf.loc[:, ["NAME", "GU_A3", "CONTINENT", "POP_EST", "ECONOMY", "geometry"]]

# Change South Sudan's code to the one used in the dataset, otherwise will not merge and appear in map
gdf.loc[gdf["NAME"] == "S. Sudan", "GU_A3"] = "SSD"

# Merge to the left on geopandas dataframe
gdf_labels = gdf.merge(
    dfi.loc[:, ["Country Name", "Country Code", "Adopter"]],
    how="left",
    left_on="GU_A3",
    right_on="Country Code",
)

# Drop missing values (not matched after merging)
gdf_labels = gdf_labels.dropna()

# Create "id" column for plotting
gdf_labels["id"] = gdf_labels.index.astype(str)

# Wolrd map choropleth
fig = px.choropleth(
    gdf_labels,
    geojson=gdf_labels.__geo_interface__,
    locations="id",
    color="Adopter",
    hover_name="NAME",
    projection="natural earth",
    color_discrete_map={
        "early": sns_tab10_orange,
        "mid": sns_tab10_green,
        "late": sns_tab10_blue,
    },
    category_orders={  # Ensure the legend shows the ordered categories
        "Adopter": ["early", "mid", "late"]
    },
    hover_data={
        "id": False,  # Exclude the 'id' from hover
        "Adopter": True,  # Include 'name' (country names)
    },
)

# fig.update_layout(
#     width=800,
#     height=600,
# )

fig # Apparently, "fig.show()" does not render in jupyter-book!

Conclusions#

The most important conclusion of this study is likely the observation that the rate of growth in internet adoption is slowing down. While in early-adopter and even mid-adopter countries this seems to be due to natural saturation, the slowdown is particularly striking in late-adopter countries, as the proportion of the population using the internet in these countries is still considerably low. This could serve as a warning, indicating the need to invest in infrastructure and resources in these nations.