Hairloss Analysis#

A DataCamp challenge      Nov, 2024

Data Analysis Machine Learning

The project#

As we age, hair loss becomes one of the health concerns of many people. The fullness of hair not only affects appearance, but is also closely related to an individual’s health.

A survey brings together a variety of factors that may contribute to hair loss, including genetic factors, hormonal changes, medical conditions, medications, nutritional deficiencies, psychological stress, and more. Through data exploration and analysis, the potential correlation between these factors and hair loss can be deeply explored, thereby providing useful reference for the development of individual health management, medical intervention and related industries.

The data#

Data contains information on persons in this survey. Each row represents one person.

  • Id - A unique identifier for each person.

  • Genetics - Whether the person has a family history of baldness.

  • Hormonal Changes - Indicates whether the individual has experienced hormonal changes (Yes/No).

  • Medical Conditions - Medical history that may lead to baldness; alopecia areata, thyroid problems, scalp infections, psoriasis, dermatitis, etc.

  • Medications & Treatments - History of medications that may cause hair loss; chemotherapy, heart medications, antidepressants, steroids, etc.

  • Nutritional Deficiencies - Lists nutritional deficiencies that may contribute to hair loss, such as iron deficiency, vitamin D deficiency, biotin deficiency, omega-3 fatty acid deficiency, etc.

  • Stress - Indicates the stress level of the individual (Low/Moderate/High).

  • Age - Represents the age of the individual.

  • Poor Hair Care Habits - Indicates whether the individual practices poor hair care habits (Yes/No).

  • Environmental Factors - Indicates whether the individual is exposed to environmental factors that may contribute to hair loss (Yes/No).

  • Smoking - Indicates whether the individual smokes (Yes/No).

  • Weight Loss - Indicates whether the individual has experienced significant weight loss (Yes/No).

  • Hair Loss - Binary variable indicating the presence (1) or absence (0) of baldness in the individual.

Data validation#

Read the data#

Hide code cell source
# Import packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    precision_score,
    recall_score,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold

# Read the data from file
hairloss = pd.read_csv("data/Predict Hair Fall.csv")
hairloss
Id Genetics Hormonal Changes Medical Conditions Medications & Treatments Nutritional Deficiencies Stress Age Poor Hair Care Habits Environmental Factors Smoking Weight Loss Hair Loss
0 133992 Yes No No Data No Data Magnesium deficiency Moderate 19 Yes Yes No No 0
1 148393 No No Eczema Antibiotics Magnesium deficiency High 43 Yes Yes No No 0
2 155074 No No Dermatosis Antifungal Cream Protein deficiency Moderate 26 Yes Yes No Yes 0
3 118261 Yes Yes Ringworm Antibiotics Biotin Deficiency Moderate 46 Yes Yes No No 0
4 111915 No No Psoriasis Accutane Iron deficiency Moderate 30 No Yes Yes No 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
994 184367 Yes No Seborrheic Dermatitis Rogaine Vitamin A Deficiency Low 33 Yes Yes Yes Yes 1
995 164777 Yes Yes No Data Accutane Protein deficiency Low 47 No No No Yes 0
996 143273 No Yes Androgenetic Alopecia Antidepressants Protein deficiency Moderate 20 Yes No Yes Yes 1
997 169123 No Yes Dermatitis Immunomodulators Biotin Deficiency Moderate 32 Yes Yes Yes Yes 1
998 127183 Yes Yes Psoriasis Blood Pressure Medication Vitamin D Deficiency Low 34 No Yes No No 1

999 rows × 13 columns

Check data quality#

Missing values#

Hide code cell source
# Dataframe info
hairloss.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Id                         999 non-null    int64 
 1   Genetics                   999 non-null    object
 2   Hormonal Changes           999 non-null    object
 3   Medical Conditions         999 non-null    object
 4   Medications & Treatments   999 non-null    object
 5   Nutritional Deficiencies   999 non-null    object
 6   Stress                     999 non-null    object
 7   Age                        999 non-null    int64 
 8   Poor Hair Care Habits      999 non-null    object
 9   Environmental Factors      999 non-null    object
 10  Smoking                    999 non-null    object
 11  Weight Loss                999 non-null    object
 12  Hair Loss                  999 non-null    int64 
dtypes: int64(3), object(10)
memory usage: 101.6+ KB

There are no missing values.

Duplicated values#

Hide code cell source
# Look for duplicated rows
hairloss.duplicated().sum()
0

There are no complete duplicated rows.

Let’s see if Id values (which should be unique) are duplicated:

Hide code cell source
hairloss.loc[hairloss.duplicated(subset=["Id"], keep=False), :].sort_values(by=["Id"])
Id Genetics Hormonal Changes Medical Conditions Medications & Treatments Nutritional Deficiencies Stress Age Poor Hair Care Habits Environmental Factors Smoking Weight Loss Hair Loss
388 110171 Yes No Thyroid Problems Antifungal Cream Selenium deficiency Low 25 Yes No Yes Yes 0
408 110171 No No Psoriasis Immunomodulators Vitamin E deficiency Low 41 Yes No No Yes 0
237 157627 Yes No Dermatosis Rogaine Protein deficiency Moderate 47 Yes No Yes Yes 1
600 157627 No No Thyroid Problems Accutane Protein deficiency Moderate 44 Yes Yes Yes Yes 0
118 172639 Yes Yes Androgenetic Alopecia Heart Medication Iron deficiency Moderate 29 Yes No Yes No 1
866 172639 Yes Yes No Data Accutane Vitamin A Deficiency Low 47 Yes No No Yes 0
669 186979 No No Seborrheic Dermatitis Blood Pressure Medication Vitamin E deficiency Moderate 41 No No No No 1
956 186979 Yes Yes Seborrheic Dermatitis Chemotherapy No Data Moderate 21 No No Yes No 1

There are duplicated ID numbers, but they look like different records. I will leave them because Id numbers are not going to be considered in the analysis.

Data integrity#

I noticed that column names have trailing whitespaces, which can cause some problems adressing the names in the analysis, so I will remove them.

Hide code cell source
# Column names
hairloss.columns
Index(['Id', 'Genetics', 'Hormonal Changes', 'Medical Conditions',
       'Medications & Treatments', 'Nutritional Deficiencies ', 'Stress',
       'Age', 'Poor Hair Care Habits ', 'Environmental Factors', 'Smoking',
       'Weight Loss ', 'Hair Loss'],
      dtype='object')
Hide code cell source
# Remove leading and trailing whitespaces in the column name
hairloss.columns = hairloss.columns.str.strip()
hairloss.columns
Index(['Id', 'Genetics', 'Hormonal Changes', 'Medical Conditions',
       'Medications & Treatments', 'Nutritional Deficiencies', 'Stress', 'Age',
       'Poor Hair Care Habits', 'Environmental Factors', 'Smoking',
       'Weight Loss', 'Hair Loss'],
      dtype='object')

Check value range: categorical variables#

Let’s see if variables of type ‘object’ (strings) contain categories.

Hide code cell source
# Select column names of object type
object_cols = hairloss.select_dtypes(include="object").columns
object_cols
Index(['Genetics', 'Hormonal Changes', 'Medical Conditions',
       'Medications & Treatments', 'Nutritional Deficiencies', 'Stress',
       'Poor Hair Care Habits', 'Environmental Factors', 'Smoking',
       'Weight Loss'],
      dtype='object')
Hide code cell source
# Plot their values and counts
fig, ax = plt.subplots(int(len(object_cols) / 2), 2, figsize=(10, 14))
i = 0
for col in object_cols:
    x = int(i / 2)
    y = i % 2
    sns.despine()
    hairloss[col].value_counts().plot(ax=ax[x, y], kind="bar")
    i += 1

fig.tight_layout()
plt.show()
../_images/f6088ad599d7aab7dc83bceb829c23cd081fb77a09338ed90e4cdb3fd0645752.png

Yes, they are all categories, so I will convert them as such.

Hide code cell source
# Convert to categorical the rest of the object columns
conversion_dict = {
    k: "category" for k in hairloss.select_dtypes(include="object").columns
}
hairloss = hairloss.astype(conversion_dict)

# Check the data types
hairloss.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Id                        999 non-null    int64   
 1   Genetics                  999 non-null    category
 2   Hormonal Changes          999 non-null    category
 3   Medical Conditions        999 non-null    category
 4   Medications & Treatments  999 non-null    category
 5   Nutritional Deficiencies  999 non-null    category
 6   Stress                    999 non-null    category
 7   Age                       999 non-null    int64   
 8   Poor Hair Care Habits     999 non-null    category
 9   Environmental Factors     999 non-null    category
 10  Smoking                   999 non-null    category
 11  Weight Loss               999 non-null    category
 12  Hair Loss                 999 non-null    int64   
dtypes: category(10), int64(3)
memory usage: 35.3 KB

Let’s see how balanced is the target variable “Hair Loss”:

Hide code cell source
# Plot
fig, ax = plt.subplots(figsize=(4, 2))
sns.despine()
hairloss["Hair Loss"].value_counts().plot(ax=ax, kind="bar")
plt.show()
../_images/821d59f4b61c26903141d200b781a0d3120b8dd811fa3e18f257d59514b48687.png

Check value range: numerical variables#

Plot histogram of variable “Age” to get a sense of the value range.

Hide code cell source
# Plot
fig, ax = plt.subplots(figsize=(6, 4))
sns.despine()
sns.histplot(hairloss["Age"], ax=ax, discrete=True)
plt.show()
../_images/8164cd214ab1e3d78ab9d22573f5cdb022d4c17eb3a64b1ba260e08ed2b3d7f2.png

I will establish age intervals for the purpose of the analysis. I choose:

  • [18, 30)

  • [30, 40)

  • [40, 52)

Hide code cell source
# Establish age intervals according to available ages
bins = pd.IntervalIndex.from_tuples([(18, 30), (30, 40), (40, 51)], closed="left")

# Create new column with age intervals as category type
hairloss["age_group"] = pd.cut(hairloss["Age"], bins)
hairloss["age_group"] = hairloss["age_group"].astype("category")
hairloss.head()
Id Genetics Hormonal Changes Medical Conditions Medications & Treatments Nutritional Deficiencies Stress Age Poor Hair Care Habits Environmental Factors Smoking Weight Loss Hair Loss age_group
0 133992 Yes No No Data No Data Magnesium deficiency Moderate 19 Yes Yes No No 0 [18, 30)
1 148393 No No Eczema Antibiotics Magnesium deficiency High 43 Yes Yes No No 0 [40, 51)
2 155074 No No Dermatosis Antifungal Cream Protein deficiency Moderate 26 Yes Yes No Yes 0 [18, 30)
3 118261 Yes Yes Ringworm Antibiotics Biotin Deficiency Moderate 46 Yes Yes No No 0 [40, 51)
4 111915 No No Psoriasis Accutane Iron deficiency Moderate 30 No Yes Yes No 1 [30, 40)

Exploratory Analysis#

Here is the strategy I will use for conducting the exploratory analysis: I will divide the categories into “conditions” on one side and “factors” on the other, to visualize how the latter influence the former.

Hide code cell source
# Select category type column names
category_columns = hairloss.select_dtypes(include="category").columns

# Store names of categories with more than 3 values: these will be "conditions"
conditions = [
    column for column in category_columns if len(hairloss[column].cat.categories) > 3
]

# Store names of categories with 2 or 3 values: these will be "factors"
factors = [
    column for column in category_columns if len(hairloss[column].cat.categories) <= 3
]

print(f"conditions: {conditions}")
print(f"factors: {factors}")
conditions: ['Medical Conditions', 'Medications & Treatments', 'Nutritional Deficiencies']
factors: ['Genetics', 'Hormonal Changes', 'Stress', 'Poor Hair Care Habits', 'Environmental Factors', 'Smoking', 'Weight Loss', 'age_group']

Next, I will plot the incidence of different conditions and factors to identify any variables that significantly contribute to hair loss.

Hide code cell source
# Create a grid of plots to have a general overview
fig, ax = plt.subplots(len(factors) + 1, len(conditions), sharey=True, figsize=(12, 50))

# Iterate over conditions
for col, condition in enumerate(conditions):

    # List categories from less to more incidence and remove "No Data"
    incidence_order = (
        hairloss.groupby(condition, observed=True)["Hair Loss"]
        .mean()
        .sort_values()
        .index.remove_categories(["No Data"])
        .dropna()
    )

    # Plot in the first row overall incidence of the condition
    sns.pointplot(
        ax=ax[0, col],
        x=condition,
        y="Hair Loss",
        data=hairloss,
        order=incidence_order,
        linestyles="none",
    )
    ax[0, col].grid(axis="y")
    ax[0, col].set_axisbelow(True)
    ax[0, col].tick_params(axis="x", labelsize=10, rotation=90)
    ax[0, col].set_yticks(
        np.arange(0, 1.1, 0.1), labels=[str(n) + " %" for n in list(range(0, 110, 10))]
    )
    sns.despine()

    # For each condition iterate over factors and plot
    for line, factor in enumerate(factors):
        sns.pointplot(
            ax=ax[line + 1, col],
            x=condition,
            y="Hair Loss",
            data=hairloss,
            hue=factor,
            order=incidence_order,
            linestyles="none",
            dodge=0.2,
        )
        ax[line + 1, col].grid(axis="y")
        ax[line + 1, col].set_axisbelow(True)
        ax[line + 1, col].tick_params(axis="x", labelsize=10, rotation=90)
        ax[line + 1, col].set_yticks(
            np.arange(0, 1.1, 0.1),
            labels=[str(n) + " %" for n in list(range(0, 110, 10))],
        )
        sns.despine()

fig.tight_layout()
plt.show()
../_images/21f56734412f95706f01e4d48c57c772a746dbaa783d4beef30a9bfea2497d2a.png

Overall, confidence intervals overlap in most cases, so from a statistical point of view, we cannot asses clear patterns in the population.

Here I pick up some exceptions where there is no overlap, and therefore the difference is significant and may indicate a pattern:

  • If an individual has the condition called Ringworm, then age has an influence: in this case, the younger the individual the higher the incidence of baldness.

  • If the person is undergoing Rogaine treatment, then hormonal changes relates to baldness more clearly.

  • If undergoing chemotherapy treatment, weight loss is clearly more linked to hair loss.

  • Steroids cause more hair loss in younger individuals than in older ones.

  • Iron deficiency is clearly evident in hair loss among non-smokers compared to smokers, with the latter having a lower incidence of baldness.

  • Vitamin E deficiency makes middle aged people more frequently bald compared to young people.

Machine Learning#

I will test out some models to check whether they can learn from the data to predict the hair-loss binary outcome.

4 types of classifiers will be tested:

  • Logistic Regression

  • K-Nearest Neighbours

  • Decision Tree Classifier

  • Random Forest Classifier

Hide code cell source
# Define target
target = hairloss["Hair Loss"]

# Define features
features = hairloss.drop(
    [
        "Hair Loss",  # Target
        "Id",  # Not a feature
        "Age",  # "age-group" will be considered instead
    ],
    axis=1,
)

# Identify binary and non-binary features according to number of categories
binary_features = [
    feature
    for feature in features.columns
    if len(features[feature].cat.categories) == 2
]
non_binary_features = [
    feature for feature in features.columns if len(features[feature].cat.categories) > 2
]

# Redefine features creating dummies and avoiding collinearity of binary features
features = pd.concat(
    [
        pd.get_dummies(features[binary_features], drop_first=True, dtype="int"),
        pd.get_dummies(features[non_binary_features], dtype="int"),
    ],
    axis=1,
)

# Drop "No Data" features as they do not provide meaninful info
no_data_features = [feature for feature in features.columns if "_No Data" in feature]
features = features.drop(no_data_features, axis=1)

# Assign target
y = target

# Assign features
X = features

# Split dataset into training and test set, and stratify
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns)

# Define classifier models that will be tested
models = [LogisticRegression(), KNeighborsClassifier(), DecisionTreeClassifier(), RandomForestClassifier()]

# Define ranges for parameters for each model
param_grids = [{"C": [0.01, 0.1, 1, 10, 100]},
               {"n_neighbors": range(1, 12)},
               {"max_depth": range(1, 20)},
               {"n_estimators": range(1, 50)}]

# Search for the best parameter for each model
for model, param_grid in zip(models, param_grids):
    kf = KFold(n_splits=6, shuffle=True, random_state=42)
    mdl = model
    mdl_cv = GridSearchCV(mdl, param_grid, cv=kf)
    mdl_cv.fit(X_train, y_train)
    print(f"\nResults for {model}:")
    print(f"\tBest parameter -> {mdl_cv.best_params_}")
    print(f"\tBest score in CV -> {mdl_cv.best_score_:.2f}")
    print(f"\tScore in test-set -> {mdl_cv.score(X_test, y_test):.2f}")
    print(f"\tScore in training-set -> {mdl_cv.score(X_train, y_train):.2f}\n")


# Calculate scores across parameter ranges to plot
# Logistic Regression
test_scores = []
train_scores = []

for C in param_grids[0]["C"]:

    # Instantiate model
    model = LogisticRegression(C=C)

    # Fit model to the training set
    model.fit(X_train, y_train)

    # Store results
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))

    # Build dataframe with the data
    df_lr = pd.DataFrame({"train": train_scores, "test": test_scores})


# K-Nearest Neighbours
test_scores = []
train_scores = []

for n_neighbors in param_grids[1]["n_neighbors"]:

    # Instantiate model
    model = KNeighborsClassifier(n_neighbors=n_neighbors)

    # Fit model to the training set
    model.fit(X_train, y_train)

    # Store results
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))

    # Build dataframe with the data
    df_knn = pd.DataFrame({"train": train_scores, "test": test_scores})

# Decision Tree Classifier
test_scores = []
train_scores = []

for max_depth in param_grids[2]["max_depth"]:

    # Instantiate model
    model = DecisionTreeClassifier(max_depth=max_depth)

    # Fit model to the training set
    model.fit(X_train, y_train)

    # Store results
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))

    # Build dataframe with the data
    df_dt = pd.DataFrame({"train": train_scores, "test": test_scores})

# Random Forest Classifier
test_scores = []
train_scores = []

for n_estimators in param_grids[3]["n_estimators"]:

    # Instantiate model
    model = RandomForestClassifier(n_estimators=n_estimators)

    # Fit model to the training set
    model.fit(X_train, y_train)

    # Store results
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))

    # Build dataframe with the data
    df_rf = pd.DataFrame({"train": train_scores, "test": test_scores})

# Plot
fig, ax = plt.subplots(2, 2, figsize=(8, 6))
# lr
df_lr.plot(ax=ax[0, 0])
ax[0, 0].set_title("Logistic Regression", fontsize=10, fontweight="bold")
ax[0, 0].set_xlabel("C- Inverse of Regularization Strength", fontsize=10)
ax[0, 0].set_xticks(range(5), labels=["0.01", "0.1", "1", "10", "100"])
# knn
df_knn.plot(ax=ax[0, 1])
ax[0, 1].set_title("K-Nearest Neighbours", fontsize=10, fontweight="bold")
ax[0, 1].set_xlabel("n_neighbors", fontsize=10)
# dt
df_dt.plot(ax=ax[1, 0])
ax[1, 0].set_title("Decision Tree Classifier", fontsize=10, fontweight="bold")
ax[1, 0].set_xlabel("max_depth", fontsize=10)
# rf
df_rf.plot(ax=ax[1, 1])
ax[1, 1].set_title("Random Forest Classifier", fontsize=10, fontweight="bold")
ax[1, 1].set_xlabel("n_estimators", fontsize=10)
# all
for i in range(2):
    for j in range(2):
        ax[i, j].set_ylabel("Accuracy score", fontsize=10)
        ax[i, j].set_yticks(np.arange(0.4, 1.1, 0.1))
sns.despine()
fig.tight_layout()
plt.show()
Results for LogisticRegression():
	Best parameter -> {'C': 0.01}
	Best score in CV -> 0.49
	Score in test-set -> 0.50
	Score in training-set -> 0.59


Results for KNeighborsClassifier():
	Best parameter -> {'n_neighbors': 4}
	Best score in CV -> 0.50
	Score in test-set -> 0.51
	Score in training-set -> 0.66


Results for DecisionTreeClassifier():
	Best parameter -> {'max_depth': 7}
	Best score in CV -> 0.48
	Score in test-set -> 0.54
	Score in training-set -> 0.73


Results for RandomForestClassifier():
	Best parameter -> {'n_estimators': 33}
	Best score in CV -> 0.53
	Score in test-set -> 0.49
	Score in training-set -> 1.00
../_images/f8453d726549f7fdcff4d48d1d4e739e75867761c7e1ab0737dfe38d226cec43.png

No matter the model and the parameters assigned, the score for the test set is always around 0.5, which suggests that these models are making predictions that are barely better than random guessing.

Apart from the complexity of the model, that can make it overfit in the training data, it looks like in general the models are underfitting in the test dataset, failing to capture patterns in the data. Maybe (as suggested by the overlapping plots in the previous visual exploratory analysis), the problem is that the features don’t provide clear, useful information for predicting the output other than in very specific cases.

If we reduce the features to a two-dimensional space with the help of t-SNE we see that it does not separate the two classes represented by the target.

Hide code cell source
# Prepare dataframe
df_tsne = pd.concat([X, y], axis=1)

# TSNE
tsne = TSNE(learning_rate=50, random_state=42)
tsne_features = tsne.fit_transform(df_tsne.drop("Hair Loss", axis=1))

# Assign values
df_tsne["x"] = tsne_features[:, 0]
df_tsne["y"] = tsne_features[:, 1]

# Plot
fig, ax = plt.subplots(figsize=(4, 4))
sns.scatterplot(x="x", y="y", hue="Hair Loss", data=df_tsne, ax=ax)
ax.set_title("t-SNE", fontsize=12)
sns.despine()
plt.show()
../_images/11f5dde35288da4a496115bc6b6b35d0e1b3c82dc8a3b5748a32e9c849d9b20f.png

It does not seem that features contain enough separable patterns to distinguish the class of interest.

Conclusions#

  • Since none of the models considered generalize well to the test data, I conclude that the information is not sufficient to allow predictions based on patterns.

  • Despite the scarcity of records (around one thousand) relative to the number of features (around forty), the inability to generalize does not seem to be due to overfitting (memorizing the training data), but rather to underfitting, the absence of patterns. On unseen data, no model has any predictive power, no matter how simple or complex the model is made configuring its parameters.

  • Therefore, it also doesn’t make sense to conduct an analysis of the features that have the most influence on the final result, since these results are not based on a reproducible predictive rule.