Helping find a better sleep

Helping find a better sleep#

A DataCamp challenge Jan, 2025

Data Analysis

The project#

A sleep-tracking app monitors sleep patterns and collects users’ self-reported data on lifestyle habits. The idea is to identify lifestyle, health, and demographic factors that strongly correlate with poor sleep quality.

The data#

An anonymized dataset is provided of sleep and lifestyle metrics for 374 individuals. This dataset contains average values for each person calculated over the past six months.

The dataset includes 13 columns covering sleep duration, quality, disorders, exercise, stress, diet, demographics, and other factors related to sleep health.

Column	Description
`Person ID`	An identifier for each individual.
`Gender`	The gender of the person (Male/Female).
`Age`	The age of the person in years.
`Occupation`	The occupation or profession of the person.
`Sleep Duration (hours)`	The average number of hours the person sleeps per day.
`Quality of Sleep (scale: 1-10)`	A subjective rating of the quality of sleep, ranging from 1 to 10.
`Physical Activity Level (minutes/day)`	The average number of minutes the person engages in physical activity daily.
`Stress Level (scale: 1-10)`	A subjective rating of the stress level experienced by the person, ranging from 1 to 10.
`BMI Category`	The BMI category of the person (e.g., Underweight, Normal, Overweight).
`Blood Pressure (systolic/diastolic)`	The average blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.
`Heart Rate (bpm)`	The average resting heart rate of the person in beats per minute.
`Daily Steps`	The average number of steps the person takes per day.
`Sleep Disorder`	The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).

Acknowledgments: Laksika Tharmalingam, Kaggle: https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset (this is a fictitious dataset)

Data validation#

Read the data#

	Person ID	Gender	Age	Occupation	Sleep Duration	Quality of Sleep	Physical Activity Level	Stress Level	BMI Category	Blood Pressure	Heart Rate	Daily Steps	Sleep Disorder
0	1	Male	27	Software Engineer	6.1	6	42	6	Overweight	126/83	77	4200	NaN
1	2	Male	28	Doctor	6.2	6	60	8	Normal	125/80	75	10000	NaN
2	3	Male	28	Doctor	6.2	6	60	8	Normal	125/80	75	10000	NaN
3	4	Male	28	Sales Representative	5.9	4	30	8	Obese	140/90	85	3000	Sleep Apnea
4	5	Male	28	Sales Representative	5.9	4	30	8	Obese	140/90	85	3000	Sleep Apnea
...	...	...	...	...	...	...	...	...	...	...	...	...	...
369	370	Female	59	Nurse	8.1	9	75	3	Overweight	140/95	68	7000	Sleep Apnea
370	371	Female	59	Nurse	8.0	9	75	3	Overweight	140/95	68	7000	Sleep Apnea
371	372	Female	59	Nurse	8.1	9	75	3	Overweight	140/95	68	7000	Sleep Apnea
372	373	Female	59	Nurse	8.1	9	75	3	Overweight	140/95	68	7000	Sleep Apnea
373	374	Female	59	Nurse	8.1	9	75	3	Overweight	140/95	68	7000	Sleep Apnea

374 rows × 13 columns

Check data quality#

Missing values#

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           155 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB

I will imput “No” text value to missing values in “Sleep Disorder” column.

Duplicated rows#

Duplicated rows -> 0
Duplicated 'Person ID' -> 0
Duplicated except on 'Person ID' -> 242

I will consider that duplicated rows do not correspond to different people (it would be too unlikely with so many variables involved), so I will drop them alltogether assuming there was an error while recording the data.

I will also drop “Person ID” column because it has no value for the purpose of the analysis.

	Gender	Age	Occupation	Sleep Duration	Quality of Sleep	Physical Activity Level	Stress Level	BMI Category	Blood Pressure	Heart Rate	Daily Steps	Sleep Disorder
0	Male	27	Software Engineer	6.1	6	42	6	Overweight	126/83	77	4200	No
1	Male	28	Doctor	6.2	6	60	8	Normal	125/80	75	10000	No
3	Male	28	Sales Representative	5.9	4	30	8	Obese	140/90	85	3000	Sleep Apnea
5	Male	28	Software Engineer	5.9	4	30	8	Obese	140/90	85	3000	Insomnia
6	Male	29	Teacher	6.3	6	40	7	Obese	140/90	82	3500	Insomnia
...	...	...	...	...	...	...	...	...	...	...	...	...
358	Female	59	Nurse	8.0	9	75	3	Overweight	140/95	68	7000	No
359	Female	59	Nurse	8.1	9	75	3	Overweight	140/95	68	7000	No
360	Female	59	Nurse	8.2	9	75	3	Overweight	140/95	68	7000	Sleep Apnea
364	Female	59	Nurse	8.0	9	75	3	Overweight	140/95	68	7000	Sleep Apnea
366	Female	59	Nurse	8.1	9	75	3	Overweight	140/95	68	7000	Sleep Apnea

132 rows × 12 columns

Check value ranges#

Categorical columns#

Let’s see if variables of type ‘object’ (strings) contain categories.

../_images/ad798e4f4cbbe681e3362dd19ed455b9d1a531af5c8ba22af1980379bc17e4fa.png

I will proceed with the following adjustements:

In “BMI Category”, “Normal” values can be added up to “Normal Weight”.
In “Occupation” column, “Sales Representative” category could be added up to “Salesperson”.
In “Blood Pressure” column, “Systolic/Diastolic” pairs can be categorized to blood pressure levels.

Show code cell source Hide code cell source

# Replace elements
sleep = sleep.replace({"Sales Representative": "Salesperson", "Normal Weight": "Normal"})

# Split the "Blood Pressures" column into two separate columns (systolic and diastolic)
sleep[["Systolic", "Diastolic"]] = sleep["Blood Pressure"].str.split('/', expand=True)

# Convert the new columns to integers (they are strings after splitting)
sleep["Systolic"] = sleep["Systolic"].astype(int)
sleep["Diastolic"] = sleep["Diastolic"].astype(int)

# Define funtion to classify blood pressure level
def classify_blood_pressure(systolic, diastolic):
    if systolic < 90 or diastolic < 60:
        return "Low"
    elif systolic < 120 and diastolic < 80:
        return "Normal"
    elif 120 <= systolic < 130 and diastolic < 80:
        return "Elevated"
    elif 130 <= systolic < 140 or 80 <= diastolic < 90:
        return "Hypertension"
    elif systolic >= 140 or diastolic >= 90:
        return "High Hypertension"
    elif systolic > 180 or diastolic > 120:
        return "Hypertensive Crisis"
    else:
        return "Unclassified"

# Apply the classify_blood_pressure function to create the "Blood Pressure Level" column
sleep["Blood Pressure Level"] = sleep.apply(lambda row: classify_blood_pressure(row["Systolic"], row["Diastolic"]), axis=1)

# Drop columns not being used further
sleep = sleep.drop(["Blood Pressure", "Systolic", "Diastolic"], axis=1)

# # Covert columns of "object" type to category datatype
# sleep = sleep.astype({column : "category" for column in sleep.select_dtypes(include="object").columns})

# Covert columns to ordered category datatype
labels = ["Normal", "Elevated", "Hypertension", "High Hypertension"]
sleep["Blood Pressure Level"] = sleep["Blood Pressure Level"].astype(pd.CategoricalDtype(categories=labels, ordered=True))

labels = ["Normal", "Overweight", "Obese"]
sleep["BMI Category"] = sleep["BMI Category"].astype(pd.CategoricalDtype(categories=labels, ordered=True))

labels = ["No", "Sleep Apnea", "Insomnia"]
sleep["Sleep Disorder"] = sleep["Sleep Disorder"].astype(pd.CategoricalDtype(categories=labels, ordered=True))

# Convert the rest to unordered category
sleep = sleep.astype({column : "category" for column in ["Gender", "Occupation"]})

# Plot "Blood Pressure Level" to check the categories it contains
fig, ax = plt.subplots(figsize=(4, 3))
sleep["Blood Pressure Level"].value_counts().plot(ax=ax, kind="bar")
sns.despine()
plt.show()

../_images/29cd6dc8319c5a0302e6f364c443ca3fc3e60fcd2ac9e8a9f21849cb4b09fd2a.png

Numerical columns#

../_images/6b28cfe8444dccd5295b413715e98249f23c73b8b482b5608988bc97ff3b7d45.png

I will establish categorical age intervals for the purpose of the analysis. Considering available ages, I choose:

[27, 38) as “Young Adult”
[38, 49) as “Middle-aged”
[49, 60) as “Older Adult”

../_images/8b8dae8b42497c0107f4e702487c465692034323abcaaf03301776d4d5ad0586.png

I will also categorize Physical Activity Level (minutes/day) as:

Low Activity: less than 30 minutes
Medium Activity: between 30 and 60 minutes
High Activity: more than 60 minutes

../_images/9d66836bc846d6df65c3c2361bf87439766bbc3cfdac75ab4cad95a536411471.png

Data analysis#

The dataframe that will be object of the analysis looks this way:

	Gender	Age	Occupation	Sleep Duration	Quality of Sleep	Physical Activity Level	Stress Level	BMI Category	Heart Rate	Daily Steps	Sleep Disorder	Blood Pressure Level	Age Group	Physical Activity Group
0	Male	27	Software Engineer	6.1	6	42	6	Overweight	77	4200	No	Hypertension	Young Adult	Medium Activity
1	Male	28	Doctor	6.2	6	60	8	Normal	75	10000	No	Hypertension	Young Adult	Medium Activity
3	Male	28	Salesperson	5.9	4	30	8	Obese	85	3000	Sleep Apnea	High Hypertension	Young Adult	Low Activity
5	Male	28	Software Engineer	5.9	4	30	8	Obese	85	3000	Insomnia	High Hypertension	Young Adult	Low Activity
6	Male	29	Teacher	6.3	6	40	7	Obese	82	3500	Insomnia	High Hypertension	Young Adult	Medium Activity
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
358	Female	59	Nurse	8.0	9	75	3	Overweight	68	7000	No	High Hypertension	Older Adult	High Activity
359	Female	59	Nurse	8.1	9	75	3	Overweight	68	7000	No	High Hypertension	Older Adult	High Activity
360	Female	59	Nurse	8.2	9	75	3	Overweight	68	7000	Sleep Apnea	High Hypertension	Older Adult	High Activity
364	Female	59	Nurse	8.0	9	75	3	Overweight	68	7000	Sleep Apnea	High Hypertension	Older Adult	High Activity
366	Female	59	Nurse	8.1	9	75	3	Overweight	68	7000	Sleep Apnea	High Hypertension	Older Adult	High Activity

132 rows × 14 columns

Correlations#

Correlation coefficients between numerical columns:

../_images/5f8e0153dd506fef8c00a41300a198100398beae68e777214c2c1048a48373e3.png

Here are the strong correlations:

Quality of sleep (subjective) is very much correlated with Sleep Duration (more objective), so any of them will serve as a reference to evaluate the sleep.
Stress Level is very much negatively correlated with Quality of Sleep (and so with Sleep Duration).

Some milder correlations:

Daily Steps is quite correlated with Physical Activity Level, but no so much (they are not the same).
Higher Heart Rate tends to affect negatively Quality of Sleep.
Stress Level is related in some extend to Heart Rate.

Q: Who sleep better, women or men?#

../_images/e1c8c8a45383a876e528fb083a6a6012160aceed76ce82095b94f4c4c6e0ac14.png

On the left graph we can see that female individuals have a better subjective sleep quality experience than male, with not overlapping confidence intervals signaling that it could be a robust pattern.

On the right one though, the overlapping in the number of hours indicates that sleep durations in the populations maybe are not so different, we could no say confidently that women sleep more hours than men.

Q: How does age relate to sleep quality?#

../_images/fcb0dd406f8ab34de4029e8f0e2d3299e618ce8dc6ff4d7ec7c147bddc398c88.png

As we age, the sleep experience seems to improve, especially for Older Adults as there is no overlapping in the confidence interval. But…

../_images/20e89cc25896ff462f5fe21e48cc417925e5fbc9ad51a1fac0f14d56b10a2ddb.png

…there is only one male in the Older Adult group! This surely affects the Older Adult group outcome, as this group is mainly formed by females and, as we saw, females have a better sleep experience.

So we cannot say that Older Adults in general (both man and women) sleep better. We should have more male samples to asses that.

Q: Does physical activity improve sleep?#

../_images/67496990005319922d8a2d4907d97f6ef4029170615234877e59b7e78d8f1ce2.png

It looks like High Activity improves sleep quality, at least related to Medium Activity (with Low Activity group members showing a lot of variance mainly caused by the discrepancy of results for men and women: women with low activity sleep much better than men with low activity; however, the number of samples are quite limited here with respect to the other groups, so it’s risky to extrapolate).

Q: How do different health conditions affect sleep quality?#

../_images/678d3aecf931220999777cb14c92765f509e66553e4f919894efab7493549fa3.png

There is a lot of uncertainty related to Obese people because of scarcity of samples and gender imbalance. For the other groups, there is a tendency for better sleep quality for Normal compared to Overweight people, also when comparing men and women separately.

../_images/c986a05f28112ae71f5851245706fa61278ff8a0b60032c05b21e6505dcd576f.png

There are not many samples in Elevated Blood Pressure level, neither for Normal and High Hypertension groups for males, so there are not clear outcomes here. Anyway, Normal tends to have better sleep quality than Hypertension.

../_images/b3895b07b3fd7f2255b240eba64d0628e1640775e307171206016283bc16a2f4.png

From this charts we conclude that Insomnia prevents clearly a good night’s sleep. When it comes to Apnea, female individuals don’t seems to suffer from it as much as males when it comes to sleep quality.

Q: How do occupations relate to sleep quality?#

../_images/0c6f1fe1fbc03571de18fa6a75160b46f5738bd74bb9437f7359bf447d369d73.png

With as much as 10 categories on Occupation, some are underepresented, especially when it comes to gender, so conclusions are harder to extract. For example, nurses have a relatively high sleep quality, but there are no male nurses, and as we know females sleep better, this outcome could be related to gender more than to the occupation.

Conclusions#

The quality of the sleep is very much associated with the duration of it.
The level of stress is a great predictor of sleep quality no matter what other circumstances.
Women have a better subjective sleep quality experience than men, even though it is not clear than women sleep more hours than men.
Physical Activity of more than 1 hour daily seems to improve sleep quality.
Low Physical Activity (30 min) improves the sleep of women much more than of men.
Overweight people tend to sleep worse than not overweight people.
Insomnia clearly prevents from a good night’s sleep.
Apnea prevents sleep quality more for men than for females.
When it comes to Occupation, number of samples and classes are quite inbalanced, so results here depend very much on the quantity of records.