Hairloss Analysis

Hairloss Analysis#

A DataCamp challenge Nov, 2024

Data Analysis Machine Learning

The project#

As we age, hair loss becomes one of the health concerns of many people. The fullness of hair not only affects appearance, but is also closely related to an individual’s health.

A survey brings together a variety of factors that may contribute to hair loss, including genetic factors, hormonal changes, medical conditions, medications, nutritional deficiencies, psychological stress, and more. Through data exploration and analysis, the potential correlation between these factors and hair loss can be deeply explored, thereby providing useful reference for the development of individual health management, medical intervention and related industries.

The data#

Data contains information on persons in this survey. Each row represents one person.

Id - A unique identifier for each person.
Genetics - Whether the person has a family history of baldness.
Hormonal Changes - Indicates whether the individual has experienced hormonal changes (Yes/No).
Medical Conditions - Medical history that may lead to baldness; alopecia areata, thyroid problems, scalp infections, psoriasis, dermatitis, etc.
Medications & Treatments - History of medications that may cause hair loss; chemotherapy, heart medications, antidepressants, steroids, etc.
Nutritional Deficiencies - Lists nutritional deficiencies that may contribute to hair loss, such as iron deficiency, vitamin D deficiency, biotin deficiency, omega-3 fatty acid deficiency, etc.
Stress - Indicates the stress level of the individual (Low/Moderate/High).
Age - Represents the age of the individual.
Poor Hair Care Habits - Indicates whether the individual practices poor hair care habits (Yes/No).
Environmental Factors - Indicates whether the individual is exposed to environmental factors that may contribute to hair loss (Yes/No).
Smoking - Indicates whether the individual smokes (Yes/No).
Weight Loss - Indicates whether the individual has experienced significant weight loss (Yes/No).
Hair Loss - Binary variable indicating the presence (1) or absence (0) of baldness in the individual.

Data validation#

Read the data#

	Id	Genetics	Hormonal Changes	Medical Conditions	Medications & Treatments	Nutritional Deficiencies	Stress	Age	Poor Hair Care Habits	Environmental Factors	Smoking	Weight Loss	Hair Loss
0	133992	Yes	No	No Data	No Data	Magnesium deficiency	Moderate	19	Yes	Yes	No	No	0
1	148393	No	No	Eczema	Antibiotics	Magnesium deficiency	High	43	Yes	Yes	No	No	0
2	155074	No	No	Dermatosis	Antifungal Cream	Protein deficiency	Moderate	26	Yes	Yes	No	Yes	0
3	118261	Yes	Yes	Ringworm	Antibiotics	Biotin Deficiency	Moderate	46	Yes	Yes	No	No	0
4	111915	No	No	Psoriasis	Accutane	Iron deficiency	Moderate	30	No	Yes	Yes	No	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
994	184367	Yes	No	Seborrheic Dermatitis	Rogaine	Vitamin A Deficiency	Low	33	Yes	Yes	Yes	Yes	1
995	164777	Yes	Yes	No Data	Accutane	Protein deficiency	Low	47	No	No	No	Yes	0
996	143273	No	Yes	Androgenetic Alopecia	Antidepressants	Protein deficiency	Moderate	20	Yes	No	Yes	Yes	1
997	169123	No	Yes	Dermatitis	Immunomodulators	Biotin Deficiency	Moderate	32	Yes	Yes	Yes	Yes	1
998	127183	Yes	Yes	Psoriasis	Blood Pressure Medication	Vitamin D Deficiency	Low	34	No	Yes	No	No	1

999 rows × 13 columns

Check data quality#

Missing values#

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Id                         999 non-null    int64 
 1   Genetics                   999 non-null    object
 2   Hormonal Changes           999 non-null    object
 3   Medical Conditions         999 non-null    object
 4   Medications & Treatments   999 non-null    object
 5   Nutritional Deficiencies   999 non-null    object
 6   Stress                     999 non-null    object
 7   Age                        999 non-null    int64 
 8   Poor Hair Care Habits      999 non-null    object
 9   Environmental Factors      999 non-null    object
 10  Smoking                    999 non-null    object
 11  Weight Loss                999 non-null    object
 12  Hair Loss                  999 non-null    int64 
dtypes: int64(3), object(10)
memory usage: 101.6+ KB

There are no missing values.

Duplicated values#

There are no complete duplicated rows.

Let’s see if Id values (which should be unique) are duplicated:

	Id	Genetics	Hormonal Changes	Medical Conditions	Medications & Treatments	Nutritional Deficiencies	Stress	Age	Poor Hair Care Habits	Environmental Factors	Smoking	Weight Loss	Hair Loss
388	110171	Yes	No	Thyroid Problems	Antifungal Cream	Selenium deficiency	Low	25	Yes	No	Yes	Yes	0
408	110171	No	No	Psoriasis	Immunomodulators	Vitamin E deficiency	Low	41	Yes	No	No	Yes	0
237	157627	Yes	No	Dermatosis	Rogaine	Protein deficiency	Moderate	47	Yes	No	Yes	Yes	1
600	157627	No	No	Thyroid Problems	Accutane	Protein deficiency	Moderate	44	Yes	Yes	Yes	Yes	0
118	172639	Yes	Yes	Androgenetic Alopecia	Heart Medication	Iron deficiency	Moderate	29	Yes	No	Yes	No	1
866	172639	Yes	Yes	No Data	Accutane	Vitamin A Deficiency	Low	47	Yes	No	No	Yes	0
669	186979	No	No	Seborrheic Dermatitis	Blood Pressure Medication	Vitamin E deficiency	Moderate	41	No	No	No	No	1
956	186979	Yes	Yes	Seborrheic Dermatitis	Chemotherapy	No Data	Moderate	21	No	No	Yes	No	1

There are duplicated ID numbers, but they look like different records. I will leave them because Id numbers are not going to be considered in the analysis.

Data integrity#

I noticed that column names have trailing whitespaces, which can cause some problems adressing the names in the analysis, so I will remove them.

Index(['Id', 'Genetics', 'Hormonal Changes', 'Medical Conditions',
       'Medications & Treatments', 'Nutritional Deficiencies ', 'Stress',
       'Age', 'Poor Hair Care Habits ', 'Environmental Factors', 'Smoking',
       'Weight Loss ', 'Hair Loss'],
      dtype='object')

Index(['Id', 'Genetics', 'Hormonal Changes', 'Medical Conditions',
       'Medications & Treatments', 'Nutritional Deficiencies', 'Stress', 'Age',
       'Poor Hair Care Habits', 'Environmental Factors', 'Smoking',
       'Weight Loss', 'Hair Loss'],
      dtype='object')

Check value range: categorical variables#

Let’s see if variables of type ‘object’ (strings) contain categories.

Index(['Genetics', 'Hormonal Changes', 'Medical Conditions',
       'Medications & Treatments', 'Nutritional Deficiencies', 'Stress',
       'Poor Hair Care Habits', 'Environmental Factors', 'Smoking',
       'Weight Loss'],
      dtype='object')

../_images/f6088ad599d7aab7dc83bceb829c23cd081fb77a09338ed90e4cdb3fd0645752.png

Yes, they are all categories, so I will convert them as such.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Id                        999 non-null    int64   
 1   Genetics                  999 non-null    category
 2   Hormonal Changes          999 non-null    category
 3   Medical Conditions        999 non-null    category
 4   Medications & Treatments  999 non-null    category
 5   Nutritional Deficiencies  999 non-null    category
 6   Stress                    999 non-null    category
 7   Age                       999 non-null    int64   
 8   Poor Hair Care Habits     999 non-null    category
 9   Environmental Factors     999 non-null    category
 10  Smoking                   999 non-null    category
 11  Weight Loss               999 non-null    category
 12  Hair Loss                 999 non-null    int64   
dtypes: category(10), int64(3)
memory usage: 35.3 KB

Let’s see how balanced is the target variable “Hair Loss”:

../_images/821d59f4b61c26903141d200b781a0d3120b8dd811fa3e18f257d59514b48687.png

Check value range: numerical variables#

Plot histogram of variable “Age” to get a sense of the value range.

../_images/8164cd214ab1e3d78ab9d22573f5cdb022d4c17eb3a64b1ba260e08ed2b3d7f2.png

I will establish age intervals for the purpose of the analysis. I choose:

[18, 30)
[30, 40)
[40, 52)

	Id	Genetics	Hormonal Changes	Medical Conditions	Medications & Treatments	Nutritional Deficiencies	Stress	Age	Poor Hair Care Habits	Environmental Factors	Smoking	Weight Loss	Hair Loss	age_group
0	133992	Yes	No	No Data	No Data	Magnesium deficiency	Moderate	19	Yes	Yes	No	No	0	[18, 30)
1	148393	No	No	Eczema	Antibiotics	Magnesium deficiency	High	43	Yes	Yes	No	No	0	[40, 51)
2	155074	No	No	Dermatosis	Antifungal Cream	Protein deficiency	Moderate	26	Yes	Yes	No	Yes	0	[18, 30)
3	118261	Yes	Yes	Ringworm	Antibiotics	Biotin Deficiency	Moderate	46	Yes	Yes	No	No	0	[40, 51)
4	111915	No	No	Psoriasis	Accutane	Iron deficiency	Moderate	30	No	Yes	Yes	No	1	[30, 40)

Exploratory Analysis#

Here is the strategy I will use for conducting the exploratory analysis: I will divide the categories into “conditions” on one side and “factors” on the other, to visualize how the latter influence the former.

conditions: ['Medical Conditions', 'Medications & Treatments', 'Nutritional Deficiencies']
factors: ['Genetics', 'Hormonal Changes', 'Stress', 'Poor Hair Care Habits', 'Environmental Factors', 'Smoking', 'Weight Loss', 'age_group']

Next, I will plot the incidence of different conditions and factors to identify any variables that significantly contribute to hair loss.

../_images/21f56734412f95706f01e4d48c57c772a746dbaa783d4beef30a9bfea2497d2a.png

Overall, confidence intervals overlap in most cases, so from a statistical point of view, we cannot asses clear patterns in the population.

Here I pick up some exceptions where there is no overlap, and therefore the difference is significant and may indicate a pattern:

If an individual has the condition called Ringworm, then age has an influence: in this case, the younger the individual the higher the incidence of baldness.
If the person is undergoing Rogaine treatment, then hormonal changes relates to baldness more clearly.
If undergoing chemotherapy treatment, weight loss is clearly more linked to hair loss.
Steroids cause more hair loss in younger individuals than in older ones.
Iron deficiency is clearly evident in hair loss among non-smokers compared to smokers, with the latter having a lower incidence of baldness.
Vitamin E deficiency makes middle aged people more frequently bald compared to young people.

Machine Learning#

I will test out some models to check whether they can learn from the data to predict the hair-loss binary outcome.

4 types of classifiers will be tested:

Logistic Regression
K-Nearest Neighbours
Decision Tree Classifier
Random Forest Classifier

Show code cell source Hide code cell source

# Define target
target = hairloss["Hair Loss"]

# Define features
features = hairloss.drop(
    [
        "Hair Loss",  # Target
        "Id",  # Not a feature
        "Age",  # "age-group" will be considered instead
    ],
    axis=1,
)

# Identify binary and non-binary features according to number of categories
binary_features = [
    feature
    for feature in features.columns
    if len(features[feature].cat.categories) == 2
]
non_binary_features = [
    feature for feature in features.columns if len(features[feature].cat.categories) > 2
]

# Redefine features creating dummies and avoiding collinearity of binary features
features = pd.concat(
    [
        pd.get_dummies(features[binary_features], drop_first=True, dtype="int"),
        pd.get_dummies(features[non_binary_features], dtype="int"),
    ],
    axis=1,
)

# Drop "No Data" features as they do not provide meaninful info
no_data_features = [feature for feature in features.columns if "_No Data" in feature]
features = features.drop(no_data_features, axis=1)

# Assign target
y = target

# Assign features
X = features

# Split dataset into training and test set, and stratify
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns)

# Define classifier models that will be tested
models = [LogisticRegression(), KNeighborsClassifier(), DecisionTreeClassifier(), RandomForestClassifier()]

# Define ranges for parameters for each model
param_grids = [{"C": [0.01, 0.1, 1, 10, 100]},
               {"n_neighbors": range(1, 12)},
               {"max_depth": range(1, 20)},
               {"n_estimators": range(1, 50)}]

# Search for the best parameter for each model
for model, param_grid in zip(models, param_grids):
    kf = KFold(n_splits=6, shuffle=True, random_state=42)
    mdl = model
    mdl_cv = GridSearchCV(mdl, param_grid, cv=kf)
    mdl_cv.fit(X_train, y_train)
    print(f"\nResults for {model}:")
    print(f"\tBest parameter -> {mdl_cv.best_params_}")
    print(f"\tBest score in CV -> {mdl_cv.best_score_:.2f}")
    print(f"\tScore in test-set -> {mdl_cv.score(X_test, y_test):.2f}")
    print(f"\tScore in training-set -> {mdl_cv.score(X_train, y_train):.2f}\n")


# Calculate scores across parameter ranges to plot
# Logistic Regression
test_scores = []
train_scores = []

for C in param_grids[0]["C"]:

    # Instantiate model
    model = LogisticRegression(C=C)

    # Fit model to the training set
    model.fit(X_train, y_train)

    # Store results
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))

    # Build dataframe with the data
    df_lr = pd.DataFrame({"train": train_scores, "test": test_scores})


# K-Nearest Neighbours
test_scores = []
train_scores = []

for n_neighbors in param_grids[1]["n_neighbors"]:

    # Instantiate model
    model = KNeighborsClassifier(n_neighbors=n_neighbors)

    # Fit model to the training set
    model.fit(X_train, y_train)

    # Store results
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))

    # Build dataframe with the data
    df_knn = pd.DataFrame({"train": train_scores, "test": test_scores})

# Decision Tree Classifier
test_scores = []
train_scores = []

for max_depth in param_grids[2]["max_depth"]:

    # Instantiate model
    model = DecisionTreeClassifier(max_depth=max_depth)

    # Fit model to the training set
    model.fit(X_train, y_train)

    # Store results
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))

    # Build dataframe with the data
    df_dt = pd.DataFrame({"train": train_scores, "test": test_scores})

# Random Forest Classifier
test_scores = []
train_scores = []

for n_estimators in param_grids[3]["n_estimators"]:

    # Instantiate model
    model = RandomForestClassifier(n_estimators=n_estimators)

    # Fit model to the training set
    model.fit(X_train, y_train)

    # Store results
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))

    # Build dataframe with the data
    df_rf = pd.DataFrame({"train": train_scores, "test": test_scores})

# Plot
fig, ax = plt.subplots(2, 2, figsize=(8, 6))
# lr
df_lr.plot(ax=ax[0, 0])
ax[0, 0].set_title("Logistic Regression", fontsize=10, fontweight="bold")
ax[0, 0].set_xlabel("C- Inverse of Regularization Strength", fontsize=10)
ax[0, 0].set_xticks(range(5), labels=["0.01", "0.1", "1", "10", "100"])
# knn
df_knn.plot(ax=ax[0, 1])
ax[0, 1].set_title("K-Nearest Neighbours", fontsize=10, fontweight="bold")
ax[0, 1].set_xlabel("n_neighbors", fontsize=10)
# dt
df_dt.plot(ax=ax[1, 0])
ax[1, 0].set_title("Decision Tree Classifier", fontsize=10, fontweight="bold")
ax[1, 0].set_xlabel("max_depth", fontsize=10)
# rf
df_rf.plot(ax=ax[1, 1])
ax[1, 1].set_title("Random Forest Classifier", fontsize=10, fontweight="bold")
ax[1, 1].set_xlabel("n_estimators", fontsize=10)
# all
for i in range(2):
    for j in range(2):
        ax[i, j].set_ylabel("Accuracy score", fontsize=10)
        ax[i, j].set_yticks(np.arange(0.4, 1.1, 0.1))
sns.despine()
fig.tight_layout()
plt.show()

Results for LogisticRegression():
	Best parameter -> {'C': 0.01}
	Best score in CV -> 0.49
	Score in test-set -> 0.50
	Score in training-set -> 0.59


Results for KNeighborsClassifier():
	Best parameter -> {'n_neighbors': 4}
	Best score in CV -> 0.50
	Score in test-set -> 0.51
	Score in training-set -> 0.66


Results for DecisionTreeClassifier():
	Best parameter -> {'max_depth': 7}
	Best score in CV -> 0.48
	Score in test-set -> 0.54
	Score in training-set -> 0.73


Results for RandomForestClassifier():
	Best parameter -> {'n_estimators': 33}
	Best score in CV -> 0.53
	Score in test-set -> 0.49
	Score in training-set -> 1.00

../_images/f8453d726549f7fdcff4d48d1d4e739e75867761c7e1ab0737dfe38d226cec43.png

No matter the model and the parameters assigned, the score for the test set is always around 0.5, which suggests that these models are making predictions that are barely better than random guessing.

Apart from the complexity of the model, that can make it overfit in the training data, it looks like in general the models are underfitting in the test dataset, failing to capture patterns in the data. Maybe (as suggested by the overlapping plots in the previous visual exploratory analysis), the problem is that the features don’t provide clear, useful information for predicting the output other than in very specific cases.

If we reduce the features to a two-dimensional space with the help of t-SNE we see that it does not separate the two classes represented by the target.

../_images/11f5dde35288da4a496115bc6b6b35d0e1b3c82dc8a3b5748a32e9c849d9b20f.png

It does not seem that features contain enough separable patterns to distinguish the class of interest.

Conclusions#

Since none of the models considered generalize well to the test data, I conclude that the information is not sufficient to allow predictions based on patterns.
Despite the scarcity of records (around one thousand) relative to the number of features (around forty), the inability to generalize does not seem to be due to overfitting (memorizing the training data), but rather to underfitting, the absence of patterns. On unseen data, no model has any predictive power, no matter how simple or complex the model is made configuring its parameters.
Therefore, it also doesn’t make sense to conduct an analysis of the features that have the most influence on the final result, since these results are not based on a reproducible predictive rule.